SPRINGER TEXTS IN STATISTICS 


Mathematical 
Statistics 


Jun Shao 


) Springer 


Springer Texts in Statistics 


Advisors: 
George Casella Stephen Fienberg Ingram Olkin 


Springer Texts in Statistics 


Alfred: Elements of Statistics for the Life and Social Sciences 

Berger: An Introduction to Probability and Stochastic Processes 

Bilodeau and Brenner: Theory of Multivariate Statistics 

Blom: Probability and Statistics: Theory and Applications 

Brockwell and Davis: Introduction to Times Series and Forecasting, Second 
Edition 

Chow and Teicher: Probability Theory: Independence, Interchangeability, 
Martingales, Third Edition 

Christensen: Advanced Linear Modeling: Multivariate, Time Series, and Spatial 
Data: Nonparametric Regression and Response Surface Maximization, Second 
Edition 

Christensen: Log-Linear Models and Logistic Regression, Second Edition 

Christensen: Plane Answers to Complex Questions: The Theory of Linear 
Models, Third Edition 

Creighton: A First Course in Probability Models and Statistical Inference 

Davis: Statistical Methods for the Analysis of Repeated Measurements 

Dean and Voss: Design and Analysis of Experiments 

du Toit, Steyn, and Stumpf: Graphical Exploratory Data Analysis 

Durrett: Essentials of Stochastic Processes 

Edwards: Introduction to Graphical Modelling, Second Edition 

Finkelstein and Levin: Statistics for Lawyers 

Flury: A First Course in Multivariate Statistics 

Jobson: Applied Multivariate Data Analysis, Volume I: Regression and 
Experimental Design 

Jobson: Applied Multivariate Data Analysis, Volume II: Categorical and 
Multivariate Methods 

Kalbfleisch: Probability and Statistical Inference, Volume I: Probability, Second 
Edition 

Kalbfleisch: Probability and Statistical Inference, Volume II: Statistical 
Inference, Second Edition 

Karr: Probability 

Keyfitz: Applied Mathematical Demography, Second Edition 

Kiefer: Introduction to Statistical Inference 

Kokoska and Nevison: Statistical Tables and Formulae 

Kulkarni: Modeling, Analysis, Design, and Control of Stochastic Systems 

Lange: Applied Probability 

Lehmann: Elements of Large-Sample Theory 

Lehmann: Testing Statistical Hypotheses, Second Edition 

Lehmann and Casella: Theory of Point Estimation, Second Edition 

Lindman: Analysis of Variance in Experimental Design 

Lindsey: Applying Generalized Linear Models 


(continued after index) 


Jun Shao 


Mathematical Statistics 


Second Edition 


D) Springer 


Jun Shao 

Department of Statistics 
University of Wisconsin, Madison 
Madison, WI 53706-1685 

USA 

shao@stat.wisc.edu 


Editorial Board 

George Casella Stephen Fienberg 
Department of Statistics Department of Statistics 
University of Florida Carnegie Mellon University 
Gainesville, FL 32611-8545 Pittsburgh, PA 15213-3890 
USA USA 


With 7 figures. 


Library of Congress Cataloging-in-Publication Data 
Shao, Jun. 
Mathematical statistics / Jun Shao.—2nd ed. 

p. cm.— (Springer texts in statistics) 
Includes bibliographical references and index. 
ISBN 0-387-95382-5 (alk. paper) 

1. Mathematical statistics. I. Title. II. Series. 
QA276.S458 2003 
519.5—de21 2003045446 


ISBN 0-387-95382-5 Printed on acid-free paper. 
ISBN-13 978-0-387-95382-3 


© 2003 Springer Science+Business Media, LLC. 


Ingram Olkin 
Department of Statistics 
Stanford University 
Stanford, CA 94305 
USA 


All rights reserved. This work may not be translated or copied in whole or in part without the 
written permission of the publisher (Springer Science+Business Media, LLC., 233 Spring St., 
New York, N.Y., 10013, USA), except for brief excerpts in connection with reviews or scholarly 
analysis. Use in connection with any form of information storage and retrieval, electronic 
adaptation, computer software, or by similar or dissimilar methodology now known or hereafter 


developed is forbidden. 


The use in this publication of trade names, trademarks, service marks, and similar terms, even if 
they are not identified as such, is not to be taken as an expression of opinion as to whether or 


not they are subject to proprietary rights. 
Printed in the United States of America. 
98 7 6 5 4 (corrected printing as of 4h printing, 2007) 


springer.com 


To Guang, Jason, and Annie 


Preface to the First 
Edition 


This book is intended for a course entitled Mathematical Statistics offered 
at the Department of Statistics, University of Wisconsin-Madison. This 
course, taught in a mathematically rigorous fashion, covers essential ma- 
terials in statistical theory that a first or second year graduate student 
typically needs to learn as preparation for work on a Ph.D. degree in statis- 
tics. The course is designed for two 15-week semesters, with three lecture 
hours and two discussion hours in each week. Students in this course are 
assumed to have a good knowledge of advanced calculus. A course in real 
analysis or measure theory prior to this course is often recommended. 


Chapter 1 provides a quick overview of important concepts and results 
in measure-theoretic probability theory that are used as tools in math- 
ematical statistics. Chapter 2 introduces some fundamental concepts in 
statistics, including statistical models, the principle of sufficiency in data 
reduction, and two statistical approaches adopted throughout the book: 
statistical decision theory and statistical inference. Each of Chapters 3 
through 7 provides a detailed study of an important topic in statistical de- 
cision theory and inference: Chapter 3 introduces the theory of unbiased 
estimation; Chapter 4 studies theory and methods in point estimation un- 
der parametric models; Chapter 5 covers point estimation in nonparametric 
settings; Chapter 6 focuses on hypothesis testing; and Chapter 7 discusses 
interval estimation and confidence sets. The classical frequentist approach 
is adopted in this book, although the Bayesian approach is also introduced 
(§2.3.2, §4.1, §6.4.4, and 87.1.3). Asymptotic (large sample) theory, a cru- 
cial part of statistical inference, is studied throughout the book, rather than 
in a separate chapter. 

About 85% of the book covers classical results in statistical theory that 
are typically found in textbooks of a similar level. These materials are in the 
Statistics Department’s Ph.D. qualifying examination syllabus. This part 
of the book is influenced by several standard textbooks, such as Casella and 


vil 


viii Preface to the First Edition 


Berger (1990), Ferguson (1967), Lehmann (1983, 1986), and Rohatgi (1976). 
The other 15% of the book covers some topics in modern statistical theory 
that have been developed in recent years, including robustness of the least 
squares estimators, Markov chain Monte Carlo, generalized linear models, 
quasi-likelihoods, empirical likelihoods, statistical functionals, generalized 
estimation equations, the jackknife, and the bootstrap. 


In addition to the presentation of fruitful ideas and results, this book 
emphasizes the use of important tools in establishing theoretical results. 
Thus, most proofs of theorems, propositions, and lemmas are provided 
or left as exercises. Some proofs of theorems are omitted (especially in 
Chapter 1), because the proofs are lengthy or beyond the scope of the 
book (references are always provided). Each chapter contains a number of 
examples. Some of them are designed as materials covered in the discussion 
section of this course, which is typically taught by a teaching assistant (a 
senior graduate student). The exercises in each chapter form an important 
part of the book. They provide not only practice problems for students, 
but also many additional results as complementary materials to the main 
text. 


The book is essentially based on (1) my class notes taken in 1983-84 
when I was a student in this course, (2) the notes I used when I was a 
teaching assistant for this course in 1984-85, and (3) the lecture notes I 
prepared during 1997-98 as the instructor of this course. I would like to 
express my thanks to Dennis Cox, who taught this course when I was 
a student and a teaching assistant, and undoubtedly has influenced my 
teaching style and textbook for this course. I am also very grateful to 
students in my class who provided helpful comments; to Mr. Yonghee Lee, 
who helped me to prepare all the figures in this book; to the Springer-Verlag 
production and copy editors, who helped to improve the presentation; and 
to my family members, who provided support during the writing of this 
book. 


Madison, Wisconsin Jun Shao 
January 1999 


Preface to the Second 
Edition 


In addition to correcting typos and errors and making a better presentation, 
the main effort in preparing this new edition is adding some new material 
to Chapter 1 (Probability Theory) and a number of new exercises to each 
chapter. Furthermore, two new sections are created to introduce semipara- 
metric models and methods (§5.1.4) and to study the asymptotic accuracy 
of confidence sets (§7.3.4). The structure of the book remains the same. 


In Chapter 1 of the new edition, moment generating and characteristic 
functions are treated in more detail and a proof of the uniqueness theorem 
is provided; some useful moment inequalities are introduced; discussions 
on conditional independence, Markov chains, and martingales are added, 
as a continuation of the discussion of conditional expectations; the con- 
cepts of weak convergence and tightness are introduced; proofs to some key 
results in asymptotic theory, such as the dominated convergence theorem 
and monotone convergence theorem, the Lévy-Cramér continuity theorem, 
the strong and weak laws of large numbers, and Lindeberg’s central limit 
theorem, are included; and a new section (§1.5.6) is created to introduce 
Edgeworth and Cornish-Fisher expansions. As a result, Chapter 1 of the 
new edition is self-contained for important concepts, results, and proofs in 
probability theory with emphasis in statistical applications. 


Since the original book was published in 1999, I have been using it as 
a textbook for a two-semester course in mathematical statistics. Exercise 
problems accumulated during my teaching are added to this new edition. 
Some exercises that are too trivial have been removed. 


In the original book, indices on definitions, examples, theorems, propo- 
sitions, corollaries, and lemmas are included in the subject index. In the 
new edition, they are in a separate index given in the end of the book (prior 
to the author index). A list of notation and a list of abbreviations, which 
are appendices of the original book, are given after the references. 


1x 


xX Preface to the Second Edition 


The most significant change in notation is the notation for a vector. 
In the text of the new edition, a k-dimensional vector is denoted by c = 
(c1,...;Ck), whether it is treated as a column or a row vector (which is not 
important if matrix algebra is not considered). When matrix algebra is 
involved, any vector c is treated as a k x 1 matrix (a column vector) and 
its transpose c’ is treated as a 1 x k matrix (a row vector). Thus, for 
c= (c1,...,CK), C= cf +--+ +c% and cc’ is the k x k matrix whose (i, j)th 
element is cjc;. 

I would like to thank reviewers of this book for their constructive com- 
ments, the Springer-Verlag production and copy editors, students in my 
classes, and two teaching assistants, Mr. Bin Cheng and Dr. Hansheng 
Wang, who provided help in preparing the new edition. Any remaining 
errors are of course my own responsibility, and a correction of them may 
be found on my web page http://www.stat.wisc.edu/~ shao. 


Madison, Wisconsin Jun Shao 
April, 2003 


Contents 


Preface to the First Edition 
Preface to the Second Edition 


Chapter 1. Probability Theory 

1.1 Probability Spaces and Random Elements. .......... 
1.1.1 o-fields and measures ...............0.-. 
1.1.2 Measurable functions and distributions ........ 

1.2 Integration and Differentiation ..............0.. 
LiDid. IME Station: <2 ae ed et ae UR A EA EE a nen 
1.2.2 Radon-Nikodym derivative ............... 

1.3 Distributions and Their Characteristics ............ 
1.3.1 Distributions and probability densities ........ 
1.3.2 Moments and moment inequalities. .......... 
1.3.3 Moment generating and characteristic functions 

1.4 Conditional Expectations .................0.. 
1.4.1 Conditional expectations ................ 
1.4.2 Independence ................02+00-4 
1.4.3 Conditional distributions ................ 
1.4.4 Markov chains and martingales............. 

1.5 Asymptotic Theory ............. 000.000.2000. 
1.5.1 Convergence modes and stochastic orders... .... 
1.5.2 Weak convergence ............-...-.200. 
1.5.3 Convergence of transformations ............ 
1.5.4 The law of large numbers... ............. 
1.5.5 The central limit theorem. ............... 


xi 


xii Contents 


1.5.6 Edgeworth and Cornish-Fisher expansions ...... 70 

1:6 EX@rciseS 2.13: ae HH a Re & Ee ead Sr aead sy 74 
Chapter 2. Fundamentals of Statistics 91 
2.1 Populations, Samples, and Models ............... 91 
2.1.1 Populations and samples ................ 91 
2.1.2 Parametric and nonparametric models. ........ 94 
2.1.3 Exponential and location-scale families ........ 96 

2.2 Statistics, Sufficiency, and Completeness. ........... 100 
2.2.1 Statistics and their distributions ............ 100 
2.2.2 Sufficiency and minimal sufficiency .......... 103 
2.2.3 Complete statistics ...............004. 109 

2.3 Statistical Decision Theory ................00. 113 
2.3.1 Decision rules, loss functions, and risks ........ 113 
2.3.2 Admissibility and optimality .............. 116 

2.4 Statistical Inference ..........0.....2. 020-00 00.4 122 
2.4.1 Point estimators ................. 004. 122 
2.4.2 Hypothesis tests... 2... 20... ...000..000. 125 
2.4.3 Confidence sets ...........-.02. 020000 129 

2.5 Asymptotic Criteria and Inference .............0.. 131 
2.5.1 Consistency ............22. 000000008 132 
2.5.2 Asymptotic bias, variance, and mse .......... 135 
2.5.3 Asymptotic inference ...............-0.0. 139 

DO, NEIKET CISES fas. Sees ot NG AG hs age pecan Mea Siig gear ee 142 
Chapter 3. Unbiased Estimation 161 
Sl PhesUMVUEE: 2 b-4.¢ diene eee dh bie ese Mad 4 4b 4 161 
3.1.1 Sufficient and complete statistics... ......... 162 
3.1.2 A necessary and sufficient condition .......... 166 
3.1.3 Information inequality... 2... .....0....000. 169 
3.1.4 Asymptotic properties of UMVUE’s .......... 172 

ood US tatistics’: (24 fd a PP alee eee ae Pa eae as 174 
3.2.1 Some examples... ...........2..2.-2000.% 174 
3.2.2 Variances of U-statistics. 2... 2... 0.220040. 176 


3.2.3 The projection method ..............0.. 178 


Contents 


3.3 The LSE in Linear Models ................... 
3.3.1 The LSE and estimability. ........0....000. 
3.3.2 The UMVUE and BLUE ................ 
3.3.3 Robustness of LSE’s................00.0. 
3.3.4 Asymptotic properties of LSE’s .........2... 

3.4 Unbiased Estimators in Survey Problems ........... 
3.4.1 UMVUE?’s of population totals ...........2.. 
3.4.2 Horvitz-Thompson estimators ............. 

3.5 Asymptotically Unbiased Estimators... ........... 
3.5.1 Functions of unbiased estimators. ........... 
3.5.2 The method of moments ................ 
Si0.0° Vestatistics 2.0 2 a5 Saa-areoa a BM rele lg fee 
3.5.4 The weighted LSE. ............0....00.2. 


3.0;, -HiX€RCISES “eo. 4e- 4k 6. hc ed than ug eek did, does ane & 


Chapter 4. Estimation in Parametric Models 

4.1 Bayes Decisions and Estimators ................ 
41.1 Bayes:actions:. 2 6 6 adie Be ke we 
4.1.2 Empirical and hierarchical Bayes methods ...... 
4.1.3 Bayes rules and estimators ............... 
4.1.4 Markov chain Monte Carlo ............... 

AWD: TH VATION GOs35 50 Ge. Beh BAS OS oy sh, Vee tg Gi tends GRA! 43 eee eas 
4.2.1 One-parameter location families ............ 
4.2.2 One-parameter scale families .............. 
4.2.3 General location-scale families ...........2.. 

4.3 Minimaxity and Admissibility ................. 
4.3.1 Estimators with constant risks ............. 
4.3.2 Results in one-parameter exponential families 
4.3.3 Simultaneous estimation and shrinkage estimators . . 

4.4 The Method of Maximum Likelihood ............. 
4.4.1 The likelihood function and MLE’s .......... 
4.4.2 MLE’s in generalized linear models .......... 
4.4.3 Quasi-likelihoods and conditional likelihoods ..... 

4.5 Asymptotically Efficient Estimation .............. 
4.5.1 Asymptotic optimality ...............0.0. 


Xill 


Xiv Contents 


4.5.2 Asymptotic efficiency of MLE’s and RLE’s ...... 290 
4.5.3 Other asymptotically efficient estimators ....... 295 

4:6 EXerCiseS* 4.2. pe a ee A a BR AB be 299 
Chapter 5. Estimation in Nonparametric Models 319 
5.1 Distribution Estimators .................000. 319 
5.1.1 Empirical c.d.f.’s in iid. cases 2... 2. 2 320 
5.1.2 Empirical likelihoods .................. 323 
5.1.38 Density estimation... ..............04. 330 
5.1.4 Semi-parametric methods................ 333 

5.2 Statistical Functionals. .. 2... 2. ee ee 338 
5.2.1 Differentiability and asymptotic normality ...... 338 
5.2.2 L-, M-, and R-estimators and rank statistics ..... 343 

5.3 Linear Functions of Order Statistics .............. 351 
5.3.1 Sample quantiles... ...............0.0. 351 
5.3.2 Robustness and efficiency ................ 355 
5.3.3 L-estimators in linear models.............. 358 

5.4 Generalized Estimating Equations ............... 359 
5.4.1 The GEE method and its relationship with others . . 360 
5.4.2 Consistency of GEE estimators. ............ 363 
5.4.3 Asymptotic normality of GEE estimators ....... 367 

5.5 Variance Estimation ..................0-2-4- 371 
5.5.1 The substitution method ................ 372 
Hod: Phe jackkniles 2 yx Seto ue intete lo te Gh dead GES! at edad 376 
5.5.3. "The bootstrap: « . + .-.4 $65 eo bebe es 380 

00s EUXETCISESS feed he ek he Shed De eed ARM MOA A eB Bite Je 383 
Chapter 6. Hypothesis Tests 393 
6. UMP? T6StS 02 Bi Bed od_ at a rs Bhool, Sah ada aa LA Ren, 393 
6.1.1 The Neyman-Pearson lemma.............. 394 
6.1.2 Monotone likelihood ratio... ..........00.. 397 
6.1.3 UMP tests for two-sided hypotheses .......... 401 

6.2 UMP Unbiased Tests ................2..000. 404 
6.2.1 Unbiasedness, similarity, and Neyman structure ... 404 
6.2.2 UMPU tests in exponential families .......... 406 


6.2.3 UMPU tests in normal families. ............ 410 


Contents 


6.3 UMP Invariant Tests ................2..0.0. 
6.3.1 Invariance and UMPI tests .............. 
6.3.2 UMPI tests in normal linear models ......... 

6.4 Tests in Parametric Models... .............04. 
6.4.1 Likelihood ratio tests ..............00. 
6.4.2 Asymptotic tests based on likelihoods ........ 
Bis chests: 23% Pek ao & I Duokiet ard Rte ae BB bates’ 
6:44". Bayes tests 2. 2b ft t-2s ava aa dO a ee 

6.5 Tests in Nonparametric Models. ............... 


6.5.1 Sign, permutation, and rank tests .......... 


6.5.2 Kolmogorov-Smirnov and Cramér-von Mises tests 


6.5.3 Empirical likelihood ratio tests... .......2.. 
6.5.4 Asymptotic tests... .........2.2.22005 


GO. XORCISES. eed eh So ee a Da Bh ee ly AS 


Chapter 7. Confidence Sets 


7.1 Construction of Confidence Sets ............... 
7.1.1 Pivotal quantities ................04. 
7.1.2 Inverting acceptance regions of tests... ......, 
7.1.3 The Bayesian approach ................ 
7.1.4 Prediction sets............2. 2.02000 00- 
7.2 Properties of Confidence Sets... ..........0004 
7.2.1 Lengths of confidence intervals ..........0.., 
7.2.2 UMA and UMAU confidence sets .......... 
7.2.3 Randomized confidence sets ............. 
7.2.4 Invariant confidence sets ............... 
7.3 Asymptotic Confidence Sets ..............-.-. 
7.3.1 Asymptotically pivotal quantities .......... 
7.3.2 Confidence sets based on likelihoods. ........ 
7.3.3 Confidence intervals for quantiles .......... 
7.3.4 Accuracy of asymptotic confidence sets ....... 
7.4 Bootstrap Confidence Sets ............2. 000048 


7.4.1 Construction of bootstrap confidence intervals 


7.4.2 Asymptotic correctness and accuracy ........ 


7.4.3 High-order accurate bootstrap confidence sets 


XV 


XVi Contents 


7.5 Simultaneous Confidence Intervals ............... 519 
7.5.1 Bonferroni’s method. ...............000.4 519 

7.5.2 Scheffé’s method in linear models ........... 520 

7.5.3 Tukey’s method in one-way ANOVA models ..... 523 

7.5.4 Confidence bands for c.d.f.’s ............00. 525 

WO -ESxerciseSs ' ds /4/ & oe ee A etn 2 to bo aie ey el 527 
References 543 
List of Notation 555 
List of Abbreviations 557 
Index of Definitions, Main Results, and Examples 559 
Author Index 571 


Subject Index 575 


Chapter 1 


Probability Theory 


Mathematical statistics relies on probability theory, which in turn is based 
on measure theory. The present chapter provides some principal concepts 
and notational conventions of probability theory, and some important re- 
sults that are useful tools in statistics. A more complete account of proba- 
bility theory can be found in a standard textbook, for example, Billingsley 
(1986), Chung (1974), or Loéve (1977). The reader is assumed to be familiar 
with set operations and set functions (mappings) in advanced calculus. 


1.1 Probability Spaces and Random Elements 


In an elementary probability course, one defines a random experiment to 
be an experiment whose outcome cannot be predicted with certainty, and 
the probability of A (a collection of possible outcomes) to be the fraction 
of times that the outcome of the random experiment results in A in a 
large number of trials of the random experiment. A rigorous and logically 
consistent definition of probability was given by A. N. Kolmogorov in his 
measure-theoretic fundamental development of probability theory in 1933 
(Kolmogorov, 1933). 


1.1.1 o-fields and measures 


Let Q be a set of elements of interest. For example, 2 can be a set of 
numbers, a subinterval of the real line, or all possible outcomes of a random 
experiment. In probability theory, 2 is often called the outcome space, 
whereas in statistical theory, 2 is called the sample space. This is because 
in probability and statistics, Q is usually the set of all possible outcomes of 
a random experiment under study. 


2 1. Probability Theory 


A measure is a natural mathematical extension of the length, area, or 
volume of subsets in the one-, two-, or three-dimensional Euclidean space. 
In a given sample space 2, a measure is a set function defined for certain 
subsets of 9. It is necessary for this collection of subsets to satisfy certain 
properties, which are given in the following definition. 


Definition 1.1. Let F be a collection of subsets of a sample space 2. F is 
called a o-field (or c-algebra) if and only if it has the following properties. 
(i) The empty set 0 € F. 

(ii) If A € F, then the complement A‘ € F. 

(iii) If A; € F, i = 1,2,..., then their union UA; EF. I 


A pair (Q,F) consisting of a set 2 and a o-field F of subsets of Q is 
called a measurable space. The elements of F are called measurable sets in 
measure theory or events in probability and statistics. 


Since 0° = Q, it follows from (i) and (ii) in Definition 1.1 that Q € F 
if F is a o-field on 2. Also, it follows from (ii) and (iii) that if A; € F, 
i=1,2,..., and F is a o-field, then the intersection NA; € F. This can be 
shown using DeMorgan’s law: (NA;)° = UAS. 

For any given 2, there are two trivial o-fields. The first one is the 
collection containing exactly two elements, @ and Q. This is the smallest 
possible o-field on Q. The second one is the collection of all subsets of Q, 
which is called the power set and is the largest o-field on 2. 


Let us now consider some nontrivial o-fields. Let A be a nonempty 
proper subset of 2 (A CQ, AA). Then (verify) 


{0, A, ASQ} Gay 


is a o-field. In fact, this is the smallest o-field containing A in the sense that 
if F is any o-field containing A, then the o-field in (1.1) is a subcollection 
of F. In general, the smallest o-field containing C, a collection of subsets of 
Q, is denoted by o(C) and is called the o-field generated by C. Hence, the 
o-field in (1.1) is o({A}). Note that o({A, A°}), o({A, Q}), and o({A, O}) 
are all the same as o({A}). Of course, if C itself is a o-field, then o(C) =C. 


On the real line ?, there is a special o-field that will be used almost 
exclusively. Let C be the collection of all finite open intervals on R. Then 
B = o(C) is called the Borel o-field. The elements of B are called Borel 
sets. The Borel o-field B” on the k-dimensional Euclidean space R* can be 
similarly defined. It can be shown that all intervals (finite or infinite), open 
sets, and closed sets are Borel sets. To illustrate, we now show that, on the 
real line, B = a(Q), where O is the collection of all open sets. Typically, 
one needs to show that a(C) C o(O) and o(O) C o(C). Since an open 
interval is an open set, C C O and, hence, o(C) C o(O) (why?). Let U be 
an open set. Then U can be expressed as a union of a sequence of finite open 


1.1. Probability Spaces and Random Elements 3 


intervals (see Royden (1968, p.39)). Hence, U € o(C) (Definition 1.1(iii)) 
and O Cc o(C). By the definition of o(O), a(O) C a(C). This completes 
the proof. 

Let C Cc R* be a Borel set and let Bo = {CN B: B € B*}. Then 
(C, Bc) is a measurable space and Bc is called the Borel o-field on C. 


Now we can introduce the notion of a measure. 


Definition 1.2. Let (Q,7) be a measurable space. A set function v defined 
on F is called a measure if and only if it has the following properties. 

(i) 0 < v(A) < ~w for any AE F. 

(ii) v(O) = 0. 

(iii) If A; € F, i = 1,2,..., and A;’s are disjoint, ie., A; A; = @ for any 


iA# J, then 


i=1 


The triple (Q,F,v) is called a measure space. If v(Q) = 1, then v is 
called a probability measure and we usually denote it by P instead of v, in 
which case (0, F, P) is called a probability space. 

Although measure is an extension of length, area, or volume, some- 
times it can be quite abstract. For example, the following set function is a 


measure: a z z F. A E 0 
WA=) 0 Any (1.2) 


Since a measure can take oo as its value, we must know how to do arithmetic 
with co. In this book, it suffices to know that (1) for any x € R, co+- 4 = 00, 
roo=oife>0, rc =—o if x < 0, and 000 = 0; (2) 00 +00 = on; and 
(3) 00% = oo for any a > 0. However, 00 — oo or co /oo is not defined. 

The following examples provide two very important measures in proba- 
bility and statistics. 


Example 1.1 (Counting measure). Let 2 be a sample space, F the collec- 
tion of all subsets, and v(A) the number of elements in A € F (v(A) = co 
if A contains infinitely many elements). Then v is a measure on F and is 
called the counting measure. 


Example 1.2 (Lebesgue measure). There is a unique measure m on (R, B) 
that satisfies 
m/([a, b]) =b-—a (1.3) 


for every finite interval [a,b], -oo <a <b< oo. This is called the Lebesgue 
measure. If we restrict m to the measurable space ((0, 1], Bjo,1)), then m is 
a probability measure. I 


4 1. Probability Theory 


If Q is countable in the sense that there is a one-to-one correspondence 
between 2 and the set of all integers, then one can usually consider the 
trivial o-field that contains all subsets of Q and a measure that assigns a 
value to every subset of Q. When Q is uncountable (e.g., Q = R or [0,1)), 
it is not possible to define a reasonable measure for every subset of Q; for 
example, it is not possible to find a measure on all subsets of 7 and still 
satisfy property (1.3). This is why it is necessary to introduce o-fields that 
are smaller than the power set. 


The following result provides some basic properties of measures. When- 
ever we consider (A), it is implicitly assumed that A € F. 


Proposition 1.1. Let (Q,7,v) be a measure space. 
(i) (Monotonicity). If A Cc B, then v(A) < v(B). 
(ii) (Subadditivity). For any sequence Aj, Ag,..., 


iii) (Continuity). If Ay C Az C A3 C --: (or Ay D Ag D Az D--- and 
v(A1) < oo), then 
V ( lim An) = lim v(A,), 


nNn—Cco n—- co 


lim An = |) Ai (« -(\4 J. 

i=1 i=1 
Proof. We prove (i) only. The proofs of (ii) and (iii) are left as exercises. 
Since A Cc B, B = AU (ASN B) and A and ACN B are disjoint. By 
Definition 1.2(iii), v(B) = v(A)+v(A°O B), which is no smaller than (A) 
since v(A°M B) > 0 by Definition 1.2(i). I 


There is a one-to-one correspondence between the set of all probability 
measures on (R,8) and a set of functions on R. Let P be a probability 
measure. The cumulative distribution function (c.d.f.) of P is defined to be 


F(x) = P((-o,2]), wER. (1.4) 


Proposition 1.2. (i) Let F be ac.d.f. on R. Then 

(a) F(—oo) = limg_._~ F(x) = 0; 

(b) F(oo) = limy 40 F(x) = 1; 

(c) F' is nondecreasing, i.e., F(a) < F(y) if a < y; 

(d) F is right continuous, ie., limy—2.y52 F(y) = F(z). 
(ii) Suppose that a real-valued iniction F on R satisfies (a)-(d) in part (i). 
Then F is the c.d.f. of a unique probability measure on (R,B). I 


1.1. Probability Spaces and Random Elements 5 


The Cartesian product of sets (or collections of sets) [;,2 € Z = {1,..., k} 
(or {1, 2, ...}) is defined as the set of all (a1, ...,a@%) (or (a1, a2,...)), ag ET, 
i € T, and is denoted by [],-7Ti =Ti1 x --- x Te (or Ti x Tp x -- +). Let 
(Q;, Fi), i € Z, be measurable spaces. Since [],-7 F; is not necessarily a o- 
field, o (hike: Fi) is called the product o-field on the product space |] ,¢7 Qi 
and (Tes Q:,0 (Tex Fi)) is denoted by [],-7(Qi, Fi). As an example, 
consider (Q;,F;) = (R,B), i =1,...,k. Then the product space is R* and 
it can be shown that the product o-field is the same as the Borel o-field on 
R*, which is the o-field generated by the collection of all open sets in R*. 


In Example 1.2, the usual length of an interval [a,b] C R is the same as 
the Lebesgue measure of [a, b]. Consider a rectangle [a1, bi] x [a2, b2] C R?. 
The usual area of [a1, bi] x [a2, ba] is 


(b1 — a1) (bz — a2) = m([a1, bi])m([a2, ba]), (1.5) 


i.e., the product of the Lebesgue measures of two intervals [a1,b;] and 
[az,b2]. Note that [a1, bi] x [a2,b2] is a measurable set by the definition 
of the product o-field. Is m([a1, b1])m([a2, b2]) the same as the value of a 
measure defined on the product o-field? The following result answers this 
question for any product space generated by a finite number of measurable 
spaces. (Its proof can be found in Billingsley (1986, pp. 235-236).) Be- 
fore introducing this result, we need the following technical definition. A 
measure v on (Q,F) is said to be o-finite if and only if there exists a se- 
quence { Aj, Ao, ...} such that UA; = 9 and v(A;) < oo for all i. Any finite 
measure (such as a probability measure) is clearly o-finite. The Lebesgue 
measure in Example 1.2 is o-finite, since R = UA, with A, = (—n,n), 
n =1,2,.... The counting measure in Example 1.1 is o-finite if and only if 
Q is countable. The measure defined by (1.2), however, is not o-finite. 


Proposition 1.3 (Product measure theorem). Let (Q;,F7i,1;), 7 = 1,...,k, 
be measure spaces with o-finite measures, where / > 2 is an integer. Then 
there exists a unique o-finite measure on the product o-field o(F x---x Fx), 
called the product measure and denoted by 1, x --- X vz, such that 


Vy X +++ X Up(Ay xX +++ X Ag) = 11(A1) +++ RE (Ag) 
for all A; € F;,i=1,...,k. 0 


In R2, there is a unique measure, the product measure m x m, for which 
m x m([a1, 61] x [a2, b2]) is equal to the value given by (1.5). This measure 
is called the Lebesgue measure on (R?,B?). The Lebesgue measure on 
(R3, B®) is m x m x m, which equals the usual volume for a subset of the 
form [a1, bi] x [a2, b2] x [a3, b3]. The Lebesgue measure on (R*, B*) for any 
positive integer k is similarly defined. 


The concept of c.d.f. can be extended to R*. Let P be a probability 


6 1. Probability Theory 


measure on (R*, B*). The c.d.f. (or joint c.d.f.) of P is defined by 
F(a1,...,0p) = P ((—o, a1] X +++ x (—00, ax]), vs ER. (1.6) 


Again, there is a one-to-one correspondence between probability measures 
and joint c.d-f.’s on R*. Some properties of a joint c.d.f. are given in 
Exercise 10 in $1.6. If F(a1,...,v%) is a joint c.d-f., then 


F;(x) aa lim : 1 tenets ee oe cay Romer a | 


is a c.d-f. and is called the ith marginal c.d.f. Apparently, marginal c.d.f.’s 
are determined by their joint c.d.f. But a joint c.d.f. cannot be determined 
by & marginal c.d.f.’s. There is one special but important case in which a 
joint c.d.f. F is determined by its k marginal c.d.f. F;’s through 


F(a1,...,0%) = F\(a1)-+-Fi(ae), (21, 0~) € R*, (1.7) 


in which case the probability measure corresponding to F' is the product 
measure P; x --- x P, with P; being the probability measure corresponding 
to F;. 

Proposition 1.3 can be extended to cases involving infinitely many mea- 
sure spaces (Billingsley, 1986). In particular, if (R*,B*, P;), i = 1,2,..., 
are probability spaces, then there is a product probability measure P on 
TI. (R*, B*) such that for any positive integer | and B; € B*, i =1,...,1, 


P(B, x +++ x Byx Rx RE x ---) = Pi(B1)-+- Py(Bi). 


1.1.2 Measurable functions and distributions 


Since 2 can be quite arbitrary, it is often convenient to consider a function 
(mapping) f from © to a simpler space A (often A = R*). Let BC A. 
Then the inverse image of B under f is 


f-1(B) ={f € B} = we: f(w) € Bh. 


The inverse function f~' need not exist for f~!(B) to be defined. The 
reader is asked to verify the following properties: 

(a) f~*(B°) = (f-*(B))° for any BC A; 

(b) f-*(UB;) = Uf" (8) for any By c.A,4 =1,2,..... 

Let C be a collection of subsets of A. We define 


fC) = {fF VC): C eC}. 


Definition 1.3. Let (2,7) and (A,G) be measurable spaces and f a 
function from 2 to A. The function f is called a measurable function from 
(Q, F) to (A,G) if and only if f-1(G) CF. 0 


1.1. Probability Spaces and Random Elements 7 


If A = R and G = B (Borel o-field), then f is said to be Borel measurable 
or is called a Borel function on (Q,F) (or with respect to F). 

In probability theory, a measurable function is called a random ele- 
ment and denoted by one of X, Y, Z,.... If X is measurable from (Q, F) 
to (R,B), then it is called a random variable; if X is measurable from 
(Q,F) to (R*, B*), then it is called a random k-vector. If X1,...,X, are 
random variables defined on a common probability space, then the vector 
(Xq,...,X,) is a random k-vector. (As a notational convention, any vector 
c € R* is denoted by (c1,...,ch), where c; is the ith component of c.) 

If f is measurable from (Q,F) to (A,G), then f~!(G) is a sub-o-field of 
F (verify). It is called the o-field generated by f and is denoted by o(f). 

Now we consider some examples of measurable functions. If F is the 
collection of all subsets of 2, then any function f is measurable. Let A Cc . 
The indicator function for A is defined as 


med=1 9 aga 


For any BCR, 


0 0¢B,1¢B 
A O0O¢gB,1EB 
AS 0€B,1¢B 
Q 0E€B,1EB. 


I*(B) = 


Then o(I,) is the o-field given in (1.1). If A is a measurable set, then I4 
is a Borel function. 


Note that o(I,4) is a much smaller o-field than the original o-field F. 
This is another reason why we introduce the concept of measurable func- 
tions and random variables, in addition to the reason that it is easy to 
deal with numbers. Often the o-field F (such as the power set) contains 
too many subsets and we are only interested in some of them. One can 
then define a random variable X with o(X) containing subsets that are of 
interest. In general, o(X) is between the trivial o-field {0,0} and F, and 
contains more subsets if X is more complicated. For the simplest function 
I4, we have shown that o(I,4) contains only four elements. 


The class of simple functions is obtained by taking linear combinations 
of indicators of measurable sets, i.e., 


k 
pw) = dala, (w), (1.8) 


where Aj,..., A, are measurable sets on Q and ay,...,a, are real numbers. 
One can show directly that such a function is a Borel function, but it 


8 1. Probability Theory 


follows immediately from Proposition 1.4. Let Aj,,..., A, be a partition of 
Q, ie., A;’s are disjoint and A; U---U A, = Q. Then the simple function 
y given by (1.8) with distinct a;’s exactly characterizes this partition and 
a(y) = 0({A),..., Ax}). 


Proposition 1.4. Let (Q,7) be a measurable space. 

(i) f is Borel if and only if f~!(a,co) € F for alae R. 

(ii) If f and g are Borel, then so are fg and af + bg, where a and 0 are real 
numbers; also, f/g is Borel provided g(w) 4 0 for any w € 2. 

(iii) If fi, fo,... are Borel, then so are sup,, fn, infn fn, limsup,, fr, and 
liminf,, f,. Furthermore, the set 


A= {w EQ: lim fn(w) exists} 
n— oo 
is an event and the function 


—_ f limnsoo fn(w) weEaA 
hw) = { filw) w GA 


is Borel. 

(iv) Suppose that f is measurable from (Q, F) to (A,G) and g is measurable 
from (A,G) to (A,#H). Then the composite function go f is measurable from 
(Q, F) to (A, H). 

(v) Let Q be a Borel set in R?. If f is a continuous function from 2 to R4, 
then f is measurable. I 


Proposition 1.4 indicates that there are many Borel functions. In fact, 
it is hard to find a non-Borel function. 


The following result is very useful in technical proofs. Let f be a non- 
negative Borel function on (Q, 7). Then there exists a sequence of simple 
functions {y,} satisfying 0 < yi < yo <--- < f and limn.w Yn = f 
(Exercise 17 in §1.6). 

Let (Q,F,v) be a measure space and f be a measurable function from 
(Q, F) to (A,G). The induced measure by f, denoted by vo f~', is a measure 
on G defined as 


vof(B)=v(fEeB)=v(f-'(B)), Beg. (1.9) 


It is usually easier to deal with v o f~! than to deal with v since (A,G) 
is usually simpler than (Q,F). Furthermore, subsets not in o(f) are not 
involved in the definition of vo f~!. As we discussed earlier, in some cases 
we are only interested in subsets in o(f). 


If v = P is a probability measure and X is a random variable or a 
random vector, then Po X~! is called the law or the distribution of X and 


1.1. Probability Spaces and Random Elements 9 


is denoted by Px. The c.d.f. of Px defined by (1.4) or (1.6) is also called 
the c.d.f. or joint c.d.f. of X and is denoted by Fx. On the other hand, 
for any c.d.f. or joint c.d.f. F’, there exists at least one random variable 
or vector (usually there are many) defined on some probability space for 
which Fx = F. The following are some examples of random variables and 
their c.d.f.’s. More examples can be found in 81.3.1. 


Example 1.3 (Discrete c.d.f.’s). Let aj < ag < --- be a sequence of real 
numbers and let py, n = 1,2,..., be a sequence of positive numbers such 
that 57°, pn = 1. Define 


ee ET er er ee en ae 
F(a) = { eu pices ees ce cau (1.10) 
-—wO <r < Qj. 


Then F is a stepwise c.d.f. It has a jump of size p, at each ay, and is flat 
between a, and a@n41, n = 1,2,.... Such a c.d-f. is called a discrete c.d.f. 
and the corresponding random variable is called a discrete random variable. 
We can easily obtain a random variable having F’ in (1.10) as its c.d.f. For 
example, let Q = {a1, a2,...}, F be the collection of all subsets of Q, 


P(A)= Sop, AEF, (1.11) 


t:aj,EA 


and X(w) = w. One can show that P is a probability measure and the 
c.d.f. of X is F in (1.10). t 


Example 1.4 (Continuous c.d.f.’s). Opposite to the class of discrete c.d.f.’s 
is the class of continuous c.d.f.’s. Without the concepts of integration and 
differentiation introduced in the next section, we can only provide a few 
examples of continuous c.d.f.’s. One such example is the uniform c.d.f. on 
the interval [a,b] defined as 


0 -w<a4r<a 
Fia=4 = a<a<b 
1 b<a<om. 


Another example is the exponential c.d.f. defined as 


0 -3~ <2<0 
Fe)={ tase 0<2%<o, 


where @ is a fixed positive constant. Note that both uniform and exponential 
c.d.f.’s are continuous functions. I 


10 1. Probability Theory 


1.2 Integration and Differentiation 


Differentiation and integration are two of the main components of calculus. 
This is also true in measure theory or probability theory, except that inte- 
gration is introduced first whereas in calculus, differentiation is introduced 
first. 


1.2.1 Integration 


An important concept needed in probability and statistics is the integration 
of Borel functions with respect to (w.r.t.) a measure v, which is a type of 
“average”. The definition proceeds in several steps. First, we define the 
integral of a nonnegative simple function, i.e., a simple function y given by 
(1.8) with a; > 0,7 =1,...,k. 


Definition 1.4(a). The integral of a nonnegative simple function y given 
by (1.8) w.r.t. v is defined as 


k 
[ow = dD air(Ai) 1 (1.12) 


The right-hand side of (1.12) is a weighted average of a,’s with v(A;)’s 
as weights. Since aco = oo if a > 0 and aco = 0 if a = 0, the right-hand 
side of (1.12) is always well defined, although [ ydv = co is possible. Note 
that different a;’s and A;’s may produce the same function y; for example, 
with N= R, 


2I(0,1) (x) + T1,3] (x) = (0,2) (x) + Io,1) (x). 


However, one can show that different representations of y in (1.8) pro- 
duce the same value for { ydv so that the integral of a nonnegative simple 
function is well defined. 


Next, we consider a nonnegative Borel function /. 


Definition 1.4(b). Let f be a nonnegative Borel function and let S; be 
the collection of all nonnegative simple functions of the form (1.8) satisfying 
p(w) < f(w) for any w € 2. The integral of f w.r.t. v is defined as 


[ tar =suv{ [ av: pes}. | 


Hence, for any Borel function f > 0, there exists a sequence of simple 
functions 1, %2,... such that 0 < y; < f for all i and limpoo f Yndv = 


J fav. 


1.2. Integration and Differentiation Ai, 


Finally, for a Borel function f, we first define the positive part of f by 


fy (w) = max{ f(w), OF 
and the negative part of f by 


f-(w) = max{—f(w), 0}. 
Note that f; and f_ are nonnegative Borel functions, f(w) = f;(w) — 


f_(w), and |f(w)| = few) + f-©). 


Definition 1.4(c). Let f be a Borel function. We say that f fdv exists if 
and only if at least one of { fidv and [ f_dv is finite, in which case 


[fara f ted f tea. (1.13) 


When both f f;dv and f f_dv are finite, we say that f is integrable. Let 
A be a measurable set and I, be its indicator function. The integral of f 


over A is defined as 
: fav = f 1afav | 
A 


Note that a Borel function f is integrable if and only if |f| is integrable. 

It is convenient to define the integral of a measurable function f from 
(0, F,v) to (R,B), where R = RU {—00, 0}, B = a (BU {{oo}, {—co}}). 
Let Ay ={f = co} and A_ = {f = —oco}. If v(A;) = 0, we define f f,dv 
to be [ Tae fy-dv; otherwise | f4dv =o. f f—dv is similarly defined. If at 
least one of [ f,dv and { f_dv is finite, then [ fdv is defined by (1.13). 

The integral of f may be denoted differently whenever there is a need 
to indicate the variable(s) to be ae and the wore domain; for 
example, [, fdv, f f(w)dv, f f(w)dv(w), or f f(w)v(dw), and so on. In 
probability and eae f XdP e usually he as a or F(X) and 
called the expectation or expected value of X. If F is the c.d.f. of P on 
(R*,B*), { f(x)dP is also denoted by f f(x)dF (zx) or [ fdF. 


Example 1.5. Let 2 be a countable set, F be all subsets of 0, and v be 
the counting measure given in Example 1.1. For any Borel function f, it 
can be shown (exercise) that 


[fu = =S > fw). 1 (1.14) 


wEQ 


Example 1.6. If Q = R and v is the Lebesgue measure, then the Lebesgue 
integral of f over an interval [a,b] is written as Sia.b| f(x)dz = i f(x)dx 
which agrees with the Riemann integral in calculus when the latter is well 


12 1. Probability Theory 


defined. However, there are functions for which the Lebesgue integrals are 
defined but not the Riemann integrals. I 


We now introduce some properties of integrals. The proof of the follow- 
ing result is left to the reader. 


Proposition 1.5 (Linearity of integrals). Let (0, 7,v) be a measure space 
and f and g be Borel functions. 

(i) If f fdv exists and a € R, then f(af)dv exists and is equal to a [ fdv. 
(ii) If both f fdv and f gdv exist and { fdv + { gdv is well defined, then 
J(f + )dv exists and is equal to [ fdv+fgdv. 0 


If N is an event with v(N) = 0 and a statement holds for all w in the 
complement N“, then the statement is said to hold a.e. (almost everywhere) 
v (or simply a.e. if the measure vy is clear from the context). If v is a 
probability measure, then a.e. may be replaced by a.s. (almost surely). 


Proposition 1.6. Let (Q,F7,v) be a measure space and f and g be Borel. 
(i) If f <gae., then f fdv < f gdv, provided that the integrals exist. 

(ii) If f > 0 ae. and [ fdv =0, then f =0 ae. 

Proof. (i) The proof for part (i) is left to the reader. 

(ii) Let A= {f > 0} and A, = {f >n1},n =1,2,.... Then A, C A 
for any n and limp. An = UAn = A (why?). By Proposition 1.1(iii), 
limn—o V(A,) = v(A). Using part (i) and Proposition 1.5, we obtain that 


n~'y(An) = fr lady < [flay < [fav =0 
for any n. Hence v(A) =O and f=Oae. Ut 


Some direct consequences of Proposition 1.6(i) are: | [ fdv| < f |f\dv; 
if f >Oae., then f fdv > 0; and if f=g ae., then f fdv = [ gdv. 

It is sometimes required to know whether the following interchange of 
two operations is valid: 


/ lim f,dv = lim [foc (1.15) 
where {f, : n = 1,2,...} is a sequence of Borel functions. Note that we 
only require limy—oo fn exists a.e. Also, limn—oo fn is Borel (Proposition 
1.4). The following example shows that (1.15) is not always true. 


Example 1.7. Consider (R, 8B) and the Lebesgue measure. Define f,(x) = 
nlon-y(x), n = 1,2,.... Then limy... fn(z) = 0 for all x but « = 0. 
Since the Lebesgue measure of a single point set is 0 (see Example 1.2), 
limpoo fn(x) = 0 ae. and flimnoo fn(x)dz = 0. On the other hand, 
J fn(w)dx = 1 for any n and, hence, limp—soo f fn(w)de=1. 0 


1.2. Integration and Differentiation 13 


The following result gives sufficient conditions under which (1.15) holds. 


Theorem 1.1. Let fi, fo, ... be a sequence of Borel functions on (0, F,v). 
(i) (Fatou’s lemma). If f, > 0, then 


Jim inf f,dv < timint | fad. 


(ii) (Dominated convergence theorem). If limn—oo fn = f a.e. and there 
exists an integrable function g such that |f,,| < g a.e., then (1.15) holds. 
(iii) (Monotone convergence theorem). If 0 < fi < fo <--- and limpoo fn 
=f ae., then (1.15) holds. 

Proof. The results in (i) and (iii) are equivalent (exercise). Applying 
Fatou’s lemma to functions g+ f, and g— fn, we obtain that [(g+ f)dv < 
liminf, [(g + fn)dv and {(g — f)dv < liminf, [(g — fn)dv (which is the 
same as [(f — g)dv > limsup, [(fn — g)dv). Since g is integrable, these 
results imply that f fdv < liminf, f f,dv < limsup,, [ frdv < f fdv. 
Hence, the result in (i) implies the result in (ii). 


It remains to show part (iii). Let f, fi, fo,... be given in part (iii). 
From Proposition 1.6(i), there exists limp—o f frdv < J fdv. Let y be 
a simple function with 0 < y < f and let A, = {y > 0}. Suppose 
that v(Ay) = oo. Then f fdvy = oo. Let a = 2-'mingea, p(w) and 
An = {fn >a}. Then a > 0, Ai C Ag C ---, and Ay C UA, (why’). 
By Proposition 1.1, v(An,) > v(UAn) > v(Ag) = 00 and, hence, f fndv > 
tas frdv > av(An) — co. Suppose now v(A,) < oo. By Egoroff’s theorem 
(Exercise 20 in §1.6), for any € > 0, there is B C Ay with v(B) < e such that 
fn converges to f uniformly on A, B°. Hence, f fdv > Jaecps frau > 


Ja.ape fdv > dain ydv = Jedv—Jp ydy > f pdv—emax, p(w). Since 
€ is arbitrary, limp—oo f fndv > { ydv. Since ¢ is arbitrary, by Definition 
1.4(b), limnpso f fndv > f fdv. This completes the proof. t 


Example 1.8 (Interchange of differentiation and integration). Let (Q, F, v) 
be a measure space and, for any fixed 0 € R, let f(w,@) be a Borel function 
on 2. Suppose that Of(w,0)/00 exists a.e. for 6 € (a,b) C R and that 
|Of(w, @)/00| < g(w) a.e., where g is an integrable function on Q. Then, 
for each 6 € (a,b), Of (w, @)/00 is integrable and, by Theorem 1.1(ii), 


5 | w.8)dv = [ew ‘ 


Theorem 1.2 (Change of variables). Let f be measurable from (Q,F, v) 
to (A,G) and g be Borel on (A,G). Then 


[acter f savor), (1.16) 


i.e., if either integral exists, then so does the other, and the two are the 
same. U4 


14 1. Probability Theory 


The reader is encouraged to provide a proof. A complete proof is in 
Billingsley (1986, p. 219). This result extends the change of variable formula 
for Riemann integrals, ie., [ g(y)dy = f g(f(2))f'(a)da, y = f(a). 

Result (1.16) is very important in probability and statistics. Let X 
be a random variable on a probability space (Q,F,P). If EX = J, XdP 
exists, then usually it is much simpler to compute EX = 1 adPx, where 
Px = Po X7' is the law of X. Let Y be a random vector from 2 to R” and 
g be Borel from R* to R. According to (1.16), Eg(Y) can be computed as 
Ire gly)dPy or Jp cdPyy), depending on which of Py and P,y) is easier 
to handle. As a more specific example, consider k = 2, Y = (X1, X2), and 
g(Y) = X1 +X». Using Proposition 1.5(ii), E(X1 + X2) = BX, + BX> 
and, hence, E(X;+ X2) = Jp #dPx, + Jz vdPx,. Then we need to handle 
two integrals involving Px, and Px,. On the other hand, E(X, + X2) = 
‘= «dPx,+x,, which involves one integral w.r.t. Px,+x,. Unless we have 
some knowledge about the joint c.d.f. of (X1, X2), it is not easy to obtain 
xe, +X2° 

The following theorem states how to evaluate an integral w.r.t. a product 
measure via iterated integration. The reader is encouraged to prove this 
theorem. A complete proof can be found in Billingsley (1986, pp. 236-238). 


Theorem 1.3 (Fubini’s theorem). Let 1; be a o-finite measure on (Q;, F;), 
i = 1,2, and let f be a Borel function on Th 40e F;). Suppose that either 
f =>0or f is integrable w.r.t. 1, <x v2. Then 


g(we) = f|  f(wi,we)dry 
Qy 


exists a.e. v2 and defines a Borel function on Q2 whose integral w.r.t. v2 
exists, and 


/ f(wi, w2)di14 xX = | Fliarsun)ar| dir. | 
Q1x Qe Qe Q1 


This result can be naturally extended to the integral w.r.t. the product 
measure on Te. (%, Fs) for any finite positive integer k. 


Example 1.9. Let 0; = Q2 = {0,1,2,...}, and 1 = 1 be the counting 
measure (Example 1.1). A function f on Qy x Q2 defines a double sequence. 
If f >Oor f |fldv1 x ve < oo, then 


[feaxn=SY I=L Y 6d) (1.17) 
i=0 j=0 j=0 i=0 


(by Theorem 1.3 and Example 1.5). Thus, a double series can be summed 
in either order, if it is summable or f>0. I 


1.2. Integration and Differentiation 15 


1.2.2 Radon-Nikodym derivative 


Let (Q,F,v) be a measure space and f be a nonnegative Borel function. 
One can show that the set function 


MA) = [fav Wer (1.18) 


is a measure on (Q, F) (verify). Note that 
v(A)=0 implies (A) =0. (1.19) 


If (1.19) holds for two measures and vy defined on the same measurable 
space, then we say A is absolutely continuous w.r.t. vy and write \ < v. 


Formula (1.18) gives us not only a way of constructing measures, but 
also a method of computing measures of measurable sets. Let v be a well- 
known measure (such as the Lebesgue measure or the counting measure) 
and a relatively unknown measure. If we can find a function f such that 
(1.18) holds, then computing \(A) can be done through integration. A 
necessary condition for (1.18) is clearly \< v. The following result shows 
that A < v is also almost sufficient for (1.18). 


Theorem 1.4 (Radon-Nikodym theorem). Let v and \ be two measures 
on (Q,F) and v be o-finite. If \ < v, then there exists a nonnegative Borel 
function f on such that (1.18) holds. Furthermore, f is unique a.e. v, 
ie., if A(A) = J, gdv for any AC F, then f=gaev. | 


The proof of this theorem can be found in Billingsley (1986, pp. 443- 
444). If (1.18) holds, then the function f is called the Radon-Nikodym 
derivative or density of X w.r.t. v and is denoted by dA/dv. 


A useful consequence of Theorem 1.4 is that if f is Borel on (0,7) and 
J, fdv =0 for any A € F, then f = 0 ae. 


If f fdv =1 for an f > 0 ae. v, then \ given by (1.18) is a probability 
measure and f is called its probability density function (p.d.f.) w.r.t. v. 
For any probability measure P on (R*,B*) corresponding to a c.d.f. F or 
a random vector X, if P has a p.d.f. f w.r.t. a measure v, then f is also 
called the p.d.f. of F or X w.r.t. v. 


Example 1.10 (p.d.f. of a discrete c.d.f.). Consider the discrete c.d.f. F 
in (1.10) of Example 1.3 with its probability measure given by (1.11). Let 
Q = {a1,a2,...} and v be the counting measure on the power set of 2. By 
Example 1.5, 
P(A) ay fdv=S> fla), ACQ, (1.20) 
A 


ajcA 


16 1. Probability Theory 


where f(a;) = pi, 7 = 1,2,..... That is, f is the p.d.f. of P or F w.r.t. 
vy. Hence, any discrete c.d.f. has a p.d.f. w.r.t. counting measure. A p.d.f. 
w.r.t. counting measure is called a discrete p.d-f. 


Example 1.11. Let F' be a c.d.f. Assume that F' is differentiable in the 
usual sense in calculus. Let f be the derivative of F’. From calculus, 


F(x) = a fly)dy, rER. (1.21) 


Let P be the probability measure corresponding to F’. It can be shown 
that P(A) = J, fdm for any A € B, where m is the Lebesgue measure on 
R. Hence, f is the p.d.f. of P or F' w.r.t. Lebesgue measure. In this case, 
the Radon-Nikodym derivative is the same as the usual derivative of F' in 
calculus. 


A continuous c.d.f. may not have a p.d.f. w.r.t. Lebesgue measure. 
A necessary and sufficient condition for a c.d.f. F having a p.d.f. w.r.t. 
Lebesgue measure is that F' is absolute continuous in the sense that for any 
€ > 0, there exists a 6 > 0 such that for each finite collection of disjoint 
bounded open intervals (a;,b;), )>(bi;—a;) < 6 implies }>[F'(b;)— F'(a;)] < e. 
Absolute continuity is weaker than differentiability, but is stronger than 
continuity. Thus, any discontinuous c.d.f. (such as a discrete c.d.f.) is not 
absolute continuous. Note that every c.d.f. is differentiable a.e. Lebesgue 
measure (Chung, 1974, Chapter 1). Hence, if f is the p.d.f. of F w.r.t. 
Lebesgue measure, then f is the usual derivative of F' a.e. Lebesgue mea- 
sure and (1.21) holds. In such a case probabilities can be computed through 
integration. It can be shown that the uniform and exponential c.d.f.’s in 
Example 1.4 are absolute continuous and their p.d.f.’s are, respectively, 


a a<xa<b 
0 otherwise 


and 
0 —-wo<2<0 


f(a) = { O-1e-#/0 O0<24%< om. 


A p.d.f. w.r.t. Lebesgue measure is called a Lebesgue p.d.f. 
More examples of p.d.f.’s are given in §1.3.1. 


The following result provides some basic properties of Radon-Nikodym 
derivatives. The proof is left to the reader. 


Proposition 1.7 (Calculus with Radon-Nikodym derivatives). Let v be a 
o-finite measure on a measure space (Q, 7). All other measures discussed 
in (i)-(iii) are defined on (Q, F). 


1.3. Distributions and Their Characteristics 17 


(i) If A is a measure, A < v, and f > 0, then 


ae 


(Notice how the dv’s “cancel” on the right-hand side.) 
(ii) If A;, 7 = 1,2, are measures and ; < v, then Ay + Ag < v and 
d(vy + A2) = dAy dd2 
== de age 
(iii) (Chain rule). If 7 is a measure, A is a o-finite measure, andtT <A <p, 


then 
dr drdxX 


dv d\dv 
In particular, if \< v and v < X (in which case \ and vy are equivalent), 
then 


a.e. V. 


dv \dX 
(iv) Let (Q;, Fi, 1%) be a measure space and v; be o-finite, i = 1,2. Let A; be 
a o-finite measure on (Q;,.7;) and \; << 43, i = 1,2. Then \1 x Ap K 4 x19 
and 


=i 
wea = (=) a.e. v or X. 


anh Cen ae. ly Xl. IF 


d(Ay x d2) a@AL 
dv; dv2 


d(vy, x V2) 


(w1,wW2) = 


1.3. Distributions and Their Characteristics 


We now discuss some distributions useful in statistics, and their moments 
and generating functions. 


1.3.1 Distributions and probability densities 


It is often more convenient to work with p.d.f.’s than to work with c.d.f.’s. 
We now introduce some p.d.f.’s useful in statistics. 


We first consider p.d.f.’s on R. Most discrete p.d.f.’s are w.r.t. counting 
measure on the space of all nonnegative integers. Table 1.1 lists all discrete 
p.d.f.’s in elementary probability textbooks. For any discrete p.d.f. f, its 
c.d.f. F(#) can be obtained using (1.20) with A = (co, a]. Values of F(x) 
can be obtained from statistical tables or software. 


Two Lebesgue p.d.f.’s are introduced in Example 1.11. Some other use- 
ful Lebesgue p.d.f.’s are listed in Table 1.2. Note that the exponential 
p.d.f. in Example 1.11 is a special case of that in Table 1.2 with a = 0. 
For any Lebesgue p.d-f. f, (1.21) gives its c.d.f. A few c.d-f.’s have explicit 


18 


Uniform 


DU (a1, ses 


Binomial 


Bi(p,n) 


Poisson 


P(0) 


Geometric 
G(p) 
Hyper- 
geometric 
HG(r,n,m) 
Negative 
binomial 
NB(p,r) 


Log- 
distribution 


L(p) 


: Gm) 


p.d.f. 

m.g.f. 
Expectation 
Variance 
Parameter 
p.d.f. 

m.g.f. 
Expectation 
Variance 
Parameter 
p.d.f. 

m.g.f. 
Expectation 
Variance 
Parameter 
p.d.f. 

m.g.f. 
Expectation 
Variance 
Parameter 


p.d.f. 


m.g.f. 
Expectation 
Variance 
Parameter 
p.d.f. 

m.g.f. 
Expectation 
Variance 
Parameter 
p.d.f. 

m.g.f. 
Expectation 
Variance 
Parameter 


1. Probability Theory 


Table 1.1. Discrete Distributions on R 


1/m, T= A, +++5Am 
ye my FER 
yeja1 a /m 


dja1 (aj — @)?/m, @ = YF", ay/m 


rip, 
(peé+1—p)", tEeR 
np 

np(1 — p) 

pé [0,1], n=1,2,... 
Ore Fal =), 1G a 
ef(e"—1) tEeR 

0 

0 

d>0 


(1 — p)"—'p, t= 1, 2, pak 


pe'/[1—(1—p)et], t < —log(1 — p) 


1/p 

(1— p)/p? 

pe (0,1 

) (2) 7G) 
x=0,1,...,min{r,n}, r—2 
No explicit form 

rn/N 

rnm(N — r)/[N2(N — 1)] 


r,n,m=1,2,.., N=n+m 


Go) p dp), e=nr+ hs. 
log(1 — p) 


pele t= 
r/p 
r(1— p)/p? 


pé [0,1], r=1,2,... 


—(logp) a1 —p)®, @=1,2,... 


=O.) x 


5m 


<m 


log[1 — (1—p)e]/logp, te R 


—(1 — p)/(plog p) 
Gp) 
p © (0,1) 


All p.d.f.’s are w.r.t. counting measure. 


p)/ log p|/(p? log p) 


1.3. Distributions and Their Characteristics 19 


forms, whereas many others do not and they have to be evaluated numeri- 
cally or computed using tables or software. 


There are p.d.f.’s that are neither discrete nor Lebesgue. 


Example 1.12. Let X be a random variable on (0, F, P) whose c.d.f. Fx 
has a Lebesgue p.d.f. fx and Fx(c) < 1, where c is a fixed constant. Let 
Y =min{X,c},i-e., Y is the smaller of X and c. Note that Y~!((—oo, z]) = 
Q if x > cand Y~1((—co, a]) = X~1((00, a]) if <c. Hence Y is a random 
variable and the c.d.f. of Y is 


Fy (x) =} 


1 Lr>Cc 
Fx (a) DSi: 


This c.d.f. is discontinuous at c, since F'x(c) < 1. Thus, it does not have 
a Lebesgue p.d.f. It is not discrete either. Does Py, the probability mea- 
sure corresponding to Fy, have a p.d.f. w.r.t. some measure? Define a 
probability measure on (R, B), called point mass at c, by 
1 ceEA 
6-(A) = AEB 1.22 
ere oe (1.22) 
(which is a special case of the discrete uniform distribution in Table 1.1). 
Then Py < m+, where m is the Lebesgue measure, and the p.d.f. of Py 
is 


0 zZ>c 
gx = vars Cc v=C 
Te By oe ) ae r (1.23) 


A p.d.f. corresponding to a joint c.d.f. is called a joint p.d.f. The fol- 
lowing is a joint Lebesgue p.d.f. on R* that is important in statistics: 


f(x) = (2m)~*/2 [Det (Z| 27 @ HET @—W/2, gE RE, — (1.24) 


where p € R*, ¥ is a positive definite k x k matrix, Det(X) is the determi- 
nant of © and, when matrix algebra is involved, any k-vector c is treated as 
a k x 1 matrix (column vector) and c’ denotes its transpose (row vector). 
The p.d.f. in (1.24) and its c.d.f. are called the k-dimensional multivariate 
normal p.d.f. and c.d.f., and both are denoted by N;(,&). Random vec- 
tors distributed as Nz (4, 4) are also denoted by N;,(,&) for convenience. 
The normal distribution N(,07) in Table 1.2 is a special case of Ni, (j1, ©) 
with & = 1. In particular, N(0,1) is called the standard normal distribu- 
tion. When © is a nonnegative definite but singular matrix, we define X 
to be Nz(u,) if and only if c7X is N(c™pu,c™Xc) for any c € R* (N(a,0) 
is defined to be the c.d.f. of the point mass at a), which is an important 
property of N;,(~, £2) with a nonsingular © (Exercise 81). 


Another important joint p.d.f. will be introduced in Example 2.7. 


20 


1. Probability Theory 


Table 1.2. Distributions on R with Lebesgue p.d.f.’s 


Uniform p.df. (b— a)~"I(ap) (2) 
m.g.f. (e& — e%)/[(b—a)i], tER 
U(a, b) Expectation (a+ iB 
Variance (b—a)?/12 
Parameter a, bER, a<b 
Normal p.d.f. — e~ (ny /2¢ 
m.g.f. epttort? /2 tEeR 
N(u, 07) Expectation yu 
Variance o 
Parameter BER, oc >0 
Exponential  p.d-f. Ge eT ae) 
m.g.f. e%(1—0t)-1, t< ot 
E(a, 0) Expectation @+4a 
Variance @? 
Parameter O0>0, aER 
Chi-square __ p.df. rEyaeeee! ae aad PEN Co 
m.g.f. (1— 26) */?, ¢< 1/2 
a Expectation k 
Variance 2k 
Parameter heSely 2.23 
Gamma p.d.f. r ae ge —le-#/ "T0,66) (@) 
m.g.f. (1—yt)-*%, t<y? 
T(a,y) Expectation avy 
Variance ay? 
Parameter y¥>0,a>0 
Beta p.d.f. ee a TC — «)®-"Ig,1)(x) 
m.g.f. No explicit form 
B(a, B) Expectation a/(a+) 
Variance aB/[((at+ B+1)(a+ B)?] 
Parameter a>0, B>0 
Cauchy pdf, a [i+ (S4)’] 
ch.f. ev—lnt—olt| 
C(u,c) Expectation Does not exist 
Variance Does not exist 
Parameter BER, oc >0 


1.3. Distributions and Their Characteristics 21 


Table 1.2. (continued) 


t-distribution _ p.d.f. 
ch.f. 

tn Expectation 
Variance 
Parameter 


P[(n41) /2 ey ee 
Prati (1+ 2) 
No explicit form 

0, (n> 1) 

n/(n—2), (n> 2) 

WH 12s 


neem 2T[(n+m)/2)0™ cae 


F-distribution  p.d_f. 


T(n/2)F Um /2)(m-+nay@rny7F 1 (0,00) (x) 


ch.f. No explicit form 
y oan Expectation m/(m— 2), (m > 2) 
Variance 2m?(n + m — 2)/[n(m — 2)?2(m — 4)], 
(m > 4) 
Parameter MH 1 Qa 1 = Tes 
Log-normal p.d.f. syst te Woe 2H) /20° To,00) (Z) 
ch.f. No explicit form 
LN(p, 07) Expectation e#+e"/? 
Variance e2uto" (e9” _ 1) 
Parameter BER, oc > 0 
Weibull p.d.f. Bpe-1e- a") BT Gicay lh) 
ch.f. No explicit form 
W (a, 0) Expectation 6!/°T (a7! + 1) 
Variance grle {ra7t +1)—[P(a-t+ wy] 
Parameter 8@>0, a>O0 
Double p.d.f. Ze lt H1/0 
Exponential m.g.f. et /(1 — 07t7), |t| < O74 
Expectation ju 
DE(, 9) Variance 20? 
Parameter BER, O>0 
Pareto p.d.f. ba®x OTD IT, 3) (a) 
ch.f. No explicit form 
Pa(a, 6) Expectation 6a/(9—1), (@>1) 
Variance Oa? /[(@ — 1)2(@ — 2)], (6 > 2) 
Parameter 86>0, a>0 
Logistic p.df. ote @-W/2 [1 + e~@ W/o? 
m.g.f. e#T(1+o0t)I'(1 — at), |t| << 07? 
LG(p,c) Expectation yu 
Variance on? /3 
Parameter BER, > 0 


22 1. Probability Theory 


If a random k-vector (X1,..., X%) has a joint p.d.f. f w.r.t. a product 
measure 1; X --- X vy, defined on B*, then X; has the following marginal 
p.d.f. w.r.t. vj: 


fi(x) = | F (21, .-) Bi—-1, By Ti41, «+, BE) AY, ++ dYj_1dj41 +++ dvy. 
Rk-1 


Let F' be the joint c.d.f. of a random k-vector (X1,...,.X,) and Fj; be 
the marginal c.d.f. of X;, 7 = 1,...,k. If (1.7) holds, then random variables 
Xj,..., Xx are said to be independent. From the discussion in the end of 
§1.1.1, this independence means that the probability measure corresponding 
to F is the product measure of the k probability measures corresponding 
to F;’s. The meaning of independence is further discussed in §1.4.2. If 


(X1,..., Xx) has a joint p.df. f w.r.t. a product measure 11 X -+- X VR 
defined on B*, then Xj,...,X, are independent if and only if 
Ff igi te) = fi(xi)--> fr(re); (eee) ER, (1.25) 


where f; is the p.d.f. of X; w.r.t. y4;,,7 = 1,...,k. For example, using (1.24), 
one can show (exercise) that the components of N;,(j, =“) are independent 
if and only if & is a diagonal matrix. 


The following lemma is useful in considering the independence of func- 
tions of independent random variables. 


Lemma 1.1. Let Xj,...,X» be independent random variables. Then ran- 
dom variables g(X1,...,X,) and h(X,41,...; Xn) are independent, where g 
and h are Borel functions and k is an integer between l andn. I 


Lemma 1.1 can be proved directly (exercise). But it is a simple conse- 
quence of an equivalent definition of independence introduced in §1.4.2. 


Let X1,...,X% be random variables. If X; and X,; are independent for 
every pair i # j, then Xj,...,X,% are said to be pairwise independent. If 
Xj,...,X,% are independent, then clearly they are pairwise independent. 
However, the converse is not true. The following is an example. 


Example 1.13. Let X; and X2 be independent random variables each as- 
suming the values 1 and —1 with probability 0.5, and X3 = X1X. Let A; = 
{X; = 1}, — 1; 2,3. Then P(A;) = 0.5 for any Zand P(A1)P(A2)P(As3) = 
0.125. However, P(A, NA2NAs) = P(A,NA2) = P(A) P(A2) = 0.25. This 
implies that (1.7) does not hold and, hence, X1, X2,X3 are not indepen- 
dent. We now show that X1, X2, X3 are pairwise independent. It is enough 
to show that X; and X3 are independent. Let B; = {X; = —1},7=1,2,3. 
Note that A, As = A, Ao, A, Bg = A, Bo, By, Ag = Bion Ba, 
and B, 9M B3 = B,M Ae. Then the result follows from the fact that 
P(A;) = P(B;) = 0.5 for any 7 and X, and X2 are independent. I 


1.3. Distributions and Their Characteristics 23 


The random variable Y in Example 1.12 is a transformation of the 
random variable X. Transformations of random variables or vectors are 
frequently used in statistics. For a random variable or vector X, g(X) is 
a random variable or vector as long as g is measurable (Proposition 1.4). 
How do we find the c.d.f. (or p.d.f.) of g(X) when the c.d.f. (or p.d.f.) of X 
is known? In many cases, the most effective method is direct computation. 
Example 1.12 is one example. The following is another one. 


Example 1.14. Let X be a random variable with c.d.f. Fx and Lebesgue 
p-d.f. fx, and let Y = X?. Since Y~!((—oo,2]) is empty if x < 0 and 
equals Y~'({0, z]) = X—1([-/a, Vz]) if x > 0, the c.d-f. of Y is 
Fy (2) = Po ¥-*((-00, 2) 
= PoX""(-vz, Va]) 
= Fx (Va) — Fx(-Vva) 


if x >0 and Fy(x) =0 if « < 0. Clearly, the Lebesgue p.d.f. of Fy is 


fla) = 5 oelfx(V2) + fe(-VAFom(a)- (1.26) 
In particular, if ; 
r)= e-?/2 : 


which is the Lebesgue p.d.f. of the standard normal distribution N(0, 1) 


(Table 1.2), then 
i —2x 
fy (@) = Fee *T.00)(2), 
QTx 
which is the Lebesgue p.d.f. for the chi-square distribution x? (Table 1.2). 
This is actually an important result in statistics. 


In some cases, one may apply the following general result whose proof 
is left to the reader. 


Proposition 1.8. Let X be a random k-vector with a Lebesgue p.d.f. fx 
and let Y = g(X), where g is a Borel function from (R*, B*) to (R*, B*). 
Let Aj,..., 4m be disjoint sets in B* such that R* — (A; U--»U Am) has 
Lebesgue measure 0 and g on A; is one-to-one with a nonvanishing Jaco- 
bian, i.e., the determinant Det(0g(x)/Ox) £ 0 on Aj, 7 =1,...,m. Then Y 
has the following Lebesgue p.d.f.: 


fy (2) = 2 |Det (Oh; (x)/Ax) | fx (hy(x)) , 


where h; is the inverse function of g on Aj, j=1,...,m. I 


24 1. Probability Theory 


One may apply Proposition 1.8 to obtain result (1.26) in Example 1.14, 
using A; = (—0o,0), Az = (0,00), and g(x) = x”. Note that hi(x) = —y2, 
h(x) = Va, and |dh;(x)/dz| = 1/(2\/x). Another immediate application 
of Proposition 1.8 is to show that Y = AX is N;(Apu,AXA™) when X is 
Nz(u, 4), where © is positive definite, A is a k x k matrix of rank k, and 
A™ denotes the transpose of A. 


Example 1.15. Let X = (Xi, X2) be a random 2-vector having a joint 
Lebesgue p.d.f. fx. Consider first the transformation g(x) = (#1, 71 + 22). 
Using Proposition 1.8, one can show that the joint p.d.f. of g(X) is 


fo(x) (41,9) = fx(a1,y — 21), 


where y = 41 + £2 (note that the Jacobian equals 1). The marginal p.d_f. 
of Y = X, + Xo is then 


fv@)= [ty ajar, 


In particular, if X; and X2 are independent, then 


trl) = i fx, (1) fxa(y — nde. (1.28) 


Next, consider the transformation h(x1,22) = (%1/x%2,22), assuming that 
X2 #0a.s. Using Proposition 1.8, one can show that the joint p.d.f. of 
h(X) is 

facxy (2, £2) = |t2| fx (222, 22), 


where z = 2 /x2. The marginal p.d.f. of Z = X1/Xo is 


fa(2) = '; Para 


In particular, if X; and X2 are independent, then 


fa(z) = / lol fxr, (Zit) Fxca (wa) dita. (1.29) 


A number of results can be derived from (1.28) and (1.29). For example, 
if X, and X» are independent and both have the standard normal p.d-f. 
given by (1.27), then, by (1.29), the Lebesgue p.d.f. of Z = X1/X2 is 


1 
fz(z) — ae f beale Pan 
us 


7 ah: e O42") de 
T Jo 


1 
m1 + 22)’ 


1.3. Distributions and Their Characteristics 25 


which is the p.d.f. of the Cauchy distribution C(0, 1) in Table 1.2. Another 
application of formula (1.29) leads to the following important result in 
statistics. 


Example 1.16 (t-distribution and F-distribution). Let X, and X2 be 
independent random variables having the chi-square distributions 2 , and 
x2, (Table 1.2), respectively. By (1.29), the p.d.f. of Z = X1/X2 is 


wale" Tpicay(2) tases f9k 
=— 302 nitn —1-(1+z)x2/2 
falz) = NEO LUCcAEE 3 daa 
Ti(n4 + ng) /2] gm/2-1 
= Tea Fay Aa AG Oe 
P(n1/2)0 (2/2) (1 + 2) (ratma)/2 0 


where the last equality follows from the fact that 


= 1 (n1+n2)/2-1 —ao/2 
Q(r1 +n2)/2T (m1 + ng) /2] 2 € I (0,00) (£2) 


is the p.d.f. of the chi-square distribution eee Using Proposition 1.8, 
one can show that the p.d.f. of Y = (X1/n1)/(X2/n2) = (n2/nm1)Z is the 
p.d.f. of the F-distribution Fy, n, given in Table 1.2. 

Let U; be a random variable having the standard normal distribution 
N(0,1) and Uz a random variable having the chi-square distribution 2. 
Using the same argument, one can show that if U; and U2 are independent, 
then the distribution of T = U;/,/U2/n is the t-distribution t, given in 
Table 1.2. This result can also be derived using the result given in this 
example as follows. Let X; = Li and Xy = Uj. Then X; and X2 are 
independent (which can be shown directly but follows from Lemma 1.1). 
By Example 1.14, the distribution of X, is y7. Then Y = X1/(X2/n) has 
the F-distribution F),,, and its Lebesgue p.d_.f. is 


nT (n+ 1)/2je7V? 
Val (n/2y(n + a) Pr 10.00) (2): 


Note that 
pe VY U, >0 
ap en 2 aes ore a9 


The result follows from Proposition 1.8 and the fact that 


PoT™! ((-—o0,-t]) = PoT™'([t,oc)), t>0. I (1.30) 


If a random variable T satisfies (1.30), i.e., T and —T have the same 
distribution, then T and its c.d.f. and p.df. (if it exists) are said to be 


26 1. Probability Theory 


symmetric about 0. If T has a Lebesgue p.d.f. fr, then T is symmetric 
about 0 if and only if fr(@) = fr(—x) for any x > 0. T and its cdf. 
and p.d.f. are said to be symmetric about a (or symmetric for simplicity) 
if and only if T — a is symmetric about 0 for a fixed a € R. The c.d.f.’s of 
t-distributions are symmetric about 0 and the normal, Cauchy, and double 
exponential c.d.f.’s are symmetric. 


The chi-square, t-, and F-distributions in the previous examples are 
special cases of the following noncentral chi-square, t-, and F-distributions, 
which are useful in some statistical problems. 


Let X1,...,X, be independent random variables and X; = N(,1;,07), 
i=1,...,n. The distribution of the random variable Y = (X?+---+X?)/o? 
is called the noncentral chi-square distribution and denoted by x2 (5), where 
5 = (uz +---+ p2)/o? is the noncentrality parameter. The chi-square 
distribution Z in Table 1.2 is a special case of the noncentral chi-square 
distribution y7(5) with 5 = 0 and, therefore, is called a central chi-square 
distribution. It can be shown (exercise) that Y has the following Lebesgue 
p.d.f.: 


a> CEE faysnl) (1.31) 


where f(x) is the Lebesgue p.d.f. of the chi-square distribution x7. It 
follows from the definition of noncentral chi-square distributions that if 
Y1,..-, Y¥x are independent random variables and Y; has the noncentral chi- 
square distribution x?,,(6;), i = 1,...,k, then Y = Yj +---+ Y, has the 
noncentral chi-square distribution Ne tags (5, + +++ + dx). 

The result for the t-distribution in Example 1.16 can be extended to the 
case where U; has a nonzero expectation jz (U2 still has the x2, distribution 
and is independent of U;). The distribution of T = U;/./U2/n is called 
the noncentral t-distribution and denoted by t,(6), where 6 = yp is the 
noncentrality parameter. Using the same argument as that in Example 
1.15, one can show (exercise) that T has the following Lebesgue p.d.f.: 


1 foe} 2 
a oa (n—1)/2.-[(evV/'y/n—8)" +49]/2 day 1.32 
2(r+1)/2P(n/2) = | y e€ y ( ) 


The t-distribution t,, in Example 1.16 is called a central t-distribution, since 
it is a special case of the noncentral t-distribution t,(d) with 6 = 0. 


Similarly, the result for the F-distribution in Example 1.16 can be ex- 
tended to the case where X, has the noncentral chi-square distribution 
x2, (6), X2 has the central chi-square distribution x?,,, and X; and X are 
independent. The distribution of Y = (X1/n1)/(X2/n2) is called the non- 
central F-distribution and denoted by Fy, n.(5), where 6 is the noncentrality 
parameter. The F-distribution F;,,,,, in Example 1.16 is called a central 


1.3. Distributions and Their Characteristics 27 


F-distribution, since it is a special case of the noncentral F-distribution 
Fyyjn(0) with 6 = 0. It can be shown (exercise) that the noncentral F- 
distribution Fy, .n.(d) has the following Lebesgue p.d.f.: 


8 OY sams ( ut) (1.33) 


mp FAG + 11) aj +n 


where fx, ,4.(x) is the Lebesgue p.d.f. of the central F-distribution Fy,, x, 
given in Table 1.2. 


Using some results from linear algebra, we can prove the following result 
useful in analysis of variance (Scheffé, 1959; Searle, 1971). 


Theorem 1.5. (Cochran’s theorem). Suppose that X = N,,(u, I,) and 
X7™X = X7AX +-+-+ X7ALX, (1.34) 


where J, is the n x n identity matrix and A; is an n x n symmetric matrix 
with rank n;, i =1,...,k. A necessary and sufficient condition that X7A;X 
has the noncentral chi-square distribution x? (d;), i = 1,...,k, and X7A;X’s 
are independent is n = nj +--- + nx, in which case 6; = w7A;u and 
Oy +--+ +64 = pp. 
Proof. Suppose that X7A;X, i = 1,...,k, are independent and X7A;X 
has the x2, (6;) distribution. Then X7X has the y?,....4,(61 +++: + 5k) 
distribution. By definition, X7X has the noncentral chi-square distribution 
x2. (uw). By (1.34), 2 = ny +--+ + ng and 6) +--+ + dp = pM. 

Suppose now that n = nj +---+ nx. From linear algebra, for each i 
there exists cj; € R”, 7 = 1,...,n;, such that 

KAN SCX es eX: (1.35) 


in; 


Let C; be the n x n; matrix whose jth column is c;;, and C7 = (C},..., Cx). 
By (1.34) and (1.35), X7X = X7C7TACX with an n x n diagonal matrix 
A whose diagonal elements are either 1 or —1. This implies C7AC = [,. 
Thus, C is of full rank and, hence, A = (C7)~!C7!, which is positive 
definite. This shows A = I,,, which implies C7C = I, and 

Nybe tN 1 NG 

XTA;X = S- Ye (1.36) 

jam te tni-itl 
where Y; is the jth component of Y = CX. Note that Y = N,(Cu, In) 
(Exercise 43). Hence Yj’s are independent and Y; = N(Aj;,1), where \; 
is the jth component of Cu. This shows that X7A;X has the y?,,(6;) 
distribution with 6; = Ae hin deel aint fee Ate cuts: Letting X = 
w in (1.36) and (1.34), we obtain that 6; = p7 Aju and 6; +--- +d, = 
u’CTCy = py. Finally, from (1.36) and Lemma 1.1, we conclude that 
X7TA;X,1=1,...,k, are independent. I 


28 1. Probability Theory 


1.3.2 Moments and moment inequalities 


We have defined the expectation of a random variable in §1.2.1. It is an 
important characteristic of a random variable. In this section, we introduce 
moments, which are some other important characteristics of a random vari- 
able or vector. 


Let X be a random variable. If EX" is finite, where k is a positive 
integer, then EX" is called the kth moment of X or Px (the distribution 
of X). If E|X|* < co for some real number a, then E|X|* is called the ath 
absolute moment of X or Px. If yp = EX and E(X — 1)" are finite for a 
positive integer k, then E(X — 1)* is called the kth central moment of X 
or Px. If E|X|* < oo for an a > 0, then E|X| < oo for any positive t < a 
and E-X* is finite for any positive integer k < a (Exercise 54). 


The expectation and the second central moment (if they exist) are two 
important characteristics of a random variable (or its distribution) in statis- 
tics. They are listed in Tables 1.1 and 1.2 for those useful distributions. 
The expectation, also called the mean in statistics, is a measure of the cen- 
tral location of the distribution of a random variable. The second central 
moment, also called the variance in statistics, is a measure of dispersion 
or spread of a random variable. The variance of a random variable X is 
denoted by Var(X). The variance is always nonnegative. If the variance 
of X is 0, then X is equal to its mean a.s. (Proposition 1.6). The squared 
root of the variance is called the standard deviation, another important 
characteristic of a random variable in statistics. 


The concept of mean and variance can be extended to random vectors. 
The expectation of a random matrix M with (i,7)th element Mj, is defined 
to be the matrix whose (i, 7)th element is EM;;. Thus, for a random k- 
vector X = (X1,..., X,), its mean is EX = (EX),..., EX;,). The extension 
of variance is the variance-covariance matrix of X defined as 


Var(X) = B(X — EX)(X — EX)’, 
which is a k x k symmetric matrix whose diagonal elements are variances 
of X;,’s. The (i, j)th element of Var(X), 1 #4 j, is E(X; — EX;)(X; — EX;), 
which is called the covariance of X; and X, and is denoted by Cov(X;, X;). 
Let c € R* and X = (X,...,X%) be a random k-vector. Then Y = 
c’X is a random variable and, by Proposition 1.5 (linearity of integrals), 
EY =c’EX if EX exists. Also, when Var(X) is finite (i.e., all elements of 
Var(X) are finite), 
Var(Y) = E(c7X —c™ EX)’ 

= Ele (X — BX)(X — EX)'d 

= ¢|E(X — BX)(X — EX)"]e 

= c' Var(X)c. 


1.3. Distributions and Their Characteristics 29 


Since Var(Y) > 0 for any c € R*, the matrix Var(X) is nonnegative definite. 
Consequently, 


[Cov(X;, X;)]? < Var(X;)Var(X;), i#j. (37) 


An important quantity in statistics is the correlation coefficient defined to 
be px,x, = Cov(X;, X;)/,/ Var(X;) Var(X,;), which, by inequality (1.37), 
is always between —1 and 1. It is a measure of relationship between X; and 
Xj; if Px,.x; is positive (or negative), then X; and X, tend to be positively 


(or negatively) related; if Px,x, = +1, then P(X; = c, £c2.X;) = 1 with 
some constants ¢; and c2 > 0; if px, x, = 0 (ie., Cov(X;, X;) = 0), then 
X; and X; are said to be uncorrelated. If X; and X; are independent, then 
they are uncorrelated. This follows from the following more general result. 
If X1,...,X, are independent random variables and E|X1---X,| < ©, 
then, by Fubini’s theorem and the fact that the joint c.d-f. of (X1,..., Xn) 
corresponds to a product measure, we obtain that 


E(X,--»X,) = EX,--+ EX. (1.38) 


In fact, pairwise independence of Xj,...,X» implies that X;’s are uncorre- 
lated, since Cov(.X;, X;) involves only a pair of random variables. However, 
the converse is not necessarily true: uncorrelated random variables may not 
be pairwise independent. Examples can be found in Exercises 60-61. 


Let Ru = fy € R* : y = Ma with some x € R*} for any kx k 
symmetric matrix M. If a random k-vector X has a finite Var(X), then 
P(X—EX € Ryayx)) = 1. This means that if the rank of Var(X) is r < k, 
then X is in a subspace of R* with dimension r. Consequently, if Py < 
Lebesgue measure on R”, then the rank of Var(X) is k. 


Example 1.17. Let X be a random k-vector having the Nz,(j, ©) distri- 
bution. It can be shown (exercise) that EX = yw and Var(X) = U. Thus, uw 
and © in (1.24) are the mean vector and the variance-covariance matrix of 
X. If is a diagonal matrix (i.e., all components of X are uncorrelated), 
then by (1.25), the components of X are independent. This shows an im- 
portant property of random variables having normal distributions: they are 
independent if and only if they are uncorrelated. J 


There are many useful inequalities related to moments. The inequal- 
ity in (1.37) is in fact the well-known Cauchy-Schwartz inequality whose 
general form is 

[E(XY))?? < EX°EY?, (1.39) 


where X and Y are random variables with a well-defined E(XY). Inequal- 
ity (1.39) is a special case of the following Hélder’s inequality: 


E|XY| < (E|X|P)/?(E|¥|4)"/4, (1.40) 


30 1. Probability Theory 


where p and q are constants satisfying p > 1 and p-' + q~! = 1. To show 
inequality (1.40), we use the following inequality (Exercise 62): 


gy? < te + (1—t)y, (1.41) 


where x and y are nonnegative real numbers and ¢ € (0,1). If either E|X|? 
or E|Y |? is co, then (1.40) holds. Hence we can assume that both E|.X|? 
and E|Y|? are finite. Let a = (E|X|?)!/? and b = (E|Y|2)\/4._ If either 
a = 0 or b = 0, then the equality in (1.40) holds because of Proposition 
1.6(ii). Assume now a £0 andb £0. Letting « = |X/al?, y = |Y/b|?, and 
t =p} in (1.41), we obtain that 


“ab | ~ 


AY} 2 |AP EG 
< nde. 
paP —_qb4 


Taking expectations on both sides of this expression, we obtain that 


E|XY| _ E|x Ely? 1° 1 
et ee 


=). 
abs paP qa P @ 


% 


which is (1.40). In fact, the equality in (1.40) holds if and only if a|X|? = 
G\Y|2 a.s. for some nonzero constants a and (@ (Exercise 62). 


Using Holder’s inequality, we can prove Liapounov’s inequality 
x) sexy, (1.42) 


where r and s are constants satisfying 1 < r < s, and Minkowski’s inequal- 
ity 

(E|X + YP)? < (E|X|P)/? + (EIY PP), (1.43) 
where X and Y are random variables and p is a constant larger than or 
equal to 1 (Exercise 63). 


Minkowski’s inequality can be extended to the case of more than two 
random variables (Exercise 63). The following inequality is a tightened 
form of Minkowski’s inequality due to Esseen and von Bahr (1965). Let 
X1,..., Xp, be independent random variables with mean 0 and E|X;|? < co, 
i =1,...,n, where p is a constant in [1,2]. Then 


n 


Be? 


i=1 


p nm 
E < Cy >) E|X), (1.44) 


i=1 


where C, is a constant depending only on p. When 1 < p < 2, inequality 
(1.44) can be proved (Exercise 63) using inequality 


la + b|? < jal? + psgn(a)|a|?"*b+ C,|b/?, ae R,bER, 


1.3. Distributions and Their Characteristics 31 


where sgn() is 1 or —1 as & is positive or negative and 


Cy = sup (1 +2)? -1-pa)/lal. 
rER,«£~0 


For p > 2, there is a similar inequality due to Marcinkiewicz and Zygmund: 


x 
i=1 


where C;, is a constant depending only on p. A proof of inequality (1.45) 
can be found in Loéve (1977, p. 276). 

Recall from calculus that a subset A of R* is convex if and only if « € A 
and y € A imply ta + (1—t)y € A for any ¢ € [0,1]; a function f from a 
convex AC R* to R is convex if and only if 


P 
E 


Cp us 
< inert 2 EXP, (1.45) 


f(t~+(1—-t)y) <tf(r7)+-tf(y), «we A,yeA,te [0,1]; (1.46) 


and f is strictly convex if and only if (1.46) holds with < replaced by the 
strict inequality <. If f is twice differentiable on A, then a necessary and 
sufficient condition for f to be convex (or strictly convex) is that the k x k 
second-order partial derivative matrix 0? f/Ox0x7, the so-called Hessian 
matrix, is nonnegative definite (or positive definite). For a convex function 
f defined on an open convex A C R* and a random k-vector X with finite 
mean and P(X € A) = 1, a very useful inequality in probability theory and 
statistics is the following Jensen’s inequality: 


f(EX) < Ef(X). (1.47) 


If f is strictly convex, then < in (1.47) can be replaced by < unless 
P(f(X) = Ef(X)) = 1. To prove (1.47), we use without proof the fol- 
lowing fact for convex f on an open convex A C R* (see, e.g., Lehmann, 
1983, p. 53). For any y € A, there exists a vector a, € A such that 


f(x) > fy)ta(e-y), wea. (1.48) 


We also use the fact that EX € A (see, e.g., Ferguson, 1967, p. 74). Letting 
x = X and y = EX, we obtain (1.47) by taking expectations on both sides 
of (1.48). If f is strictly convex, then (1.48) holds with > replaced by >. 
By Proposition 1.6(ii), Ef(X) > f(EX) unless P(f(X) = Ef(X)) =1. 


Example 1.18. A direct application of Jensen’s inequality (1.47) is that 
if X is a nonconstant positive random variable with finite mean, then 


(EX)! < E(X~') and E(logX) < log(EX), 


32 1. Probability Theory 


since t-+ and — logt are convex functions on (0,00). Another application 
is to prove the following inequality related to entropy. Let f and g be 
positive integrable functions on a measure space with a o-finite measure v. 
If f fdv > f gdv > 0, then one can show (exercise) that 


J fe (4) dv>0. tt (1.49) 


The next inequality, Chebyshev’s inequality, is almost trivial but very 
useful and famous. Let X be a random variable and y a nonnegative and 
nondecreasing function on [0,0o) satisfying y(—t) = y(t). Then, for each 
constant t > 0, 


plt)P (|X| >) < / vpn PIMP S BOX) (1.50) 


where both inequalities in (1.50) follow from Proposition 1.6(i) and the first 
inequality also uses the fact that on the set {|X| > t}, p(X) > y(t). The 
most familiar application of (1.50) is when y(t) = |t|? for p € (0,00), in 
which case inequality (1.50) is also called Markov’s inequality. Chebyshev’s 
inequality, sometimes together with one of the moment inequalities intro- 
duced in this section, can be used to yield a desired upper bound for the 
“tail” probability P(|X| > ¢). For example, let Y be a random variable 
with mean y and variance o?. Then X = (Y — y)/o has mean 0 and vari- 
ance 1 and, by (1.50) with y(t) = t?, P(|X| > 2) < +. This means that 
the probability that the random variable |Y — ju] exceeds twice its standard 
deviation is bounded by i. Similarly, we can also claim that the probabil- 
ity of |Y — u| exceeding 30 is bounded by 3. These bounds are rough but 
they can be applied to any random variable with a finite variance. Other 
applications of Chebyshev’s inequality can be found in §1.5. 

In some cases, we need an improvement over inequality (1.50) when 
X is of some special form. Let Y1,..., Y, be independent random variables 
having finite variances. The following inequality is due to Hajek and Renyi: 


P (2 C] >t) <; “< zoe Var(Y;), t>0, (1.51) 


where c;’s are positive constants satisfying c; > cg >--+: > cn. If c¢, = 1 for 
all 7, then inequality (1.51) reduces to the famous Kolmogorov’s inequality. 
A proof for (1.51) is given in Sen and Singer (1993, pp. 65-66). 


l 


> (% - EY) 


i=l 


1.3.3 Moment generating and characteristic functions 


Moments are important characteristics of a distribution, but they do not 
determine a distribution in the sense that two different distributions may 


1.3. Distributions and Their Characteristics 33 


have the same moments of all orders. Functions that determine a distribu- 
tion are introduced in the following definition. 


Definition 1.5. Let X be a random k-vector. 
(i) The moment generating function (m.g.f.) of X or Px is defined as 


x(t) = Ee™*, ter. 
(ii) The characteristic function (ch.f.) of X or Px is defined as 


ox (t) = Eev-1"* = Efcos(t” X)] + V—1E|sin(t”X)], teR*. 1 


Obviously ~x (0) = ¢x(0) = 1 for any random vector X. The ch.f. is 
complex-valued and always well defined. In fact, any ch.f. is bounded by 
1 and is a uniformly continuous function on R* (exercise). The m.g.f. is 
nonnegative but may be oo everywhere except at t = 0 (Example 1.19). If 
the m.g.f. is finite in a neighborhood of 0 € R*, then ¢x(t) can be obtained 
by replacing t in x(t) by V—It. Tables 1.1 and 1.2 contain the m.g.f. (or 
ch.f. when the m.g.f. is co everywhere except at 0) for distributions useful 
in statistics. For a linear transformation Y = A’ X +c, where Aisakxm 
matrix and c € R”™, it follows from Definition 1.5 that 


wy (u) =e “bx(Au) and ¢y(u) = eV—le"4gy (Au), UwER™. 
(1.52) 
For a random variable X, if its m.g.f. is finite at t and —t fora t 4 0, 
then X has finite moments and absolute moments of any order. To compute 
moments of X using its m.g.f., a condition stronger than the finiteness of 
the m.g.f. at some t £ 0 is needed. Consider a random k-vector X. If Ux 
is finite in a neighborhood of 0, then py, ,...7, = E(X]* --- X;*) is finite for 
any nonnegative integers r1,...,7,%, where X; is the jth component of X, 
and wx has the power series expansion 


r soppy vee 
ox(t)= >) eet te (1.53) 


rye rp! 
(T1,-57 REZ : k 


for t in the neighborhood of 0, where t; is the jth component of t and 
Z Cc R* containing vectors whose components are nonnegative integers. 
Consequently, the components of X have finite moments of all orders and 


aritetrepy (t) 


E(X™..-XT*) = 
( 1 k ) Obi? ++ OFF 


t=0 
which are also called moments of X. In particular, 
t Zax (t 
OO)! apy OURO), BEX), (1.54) 
Ot |i LOE | x5 


34 1. Probability Theory 


and, when k = 1 and p is a positive integer, ®) (0) = EX?, where g)(t) 
denotes the pth order derivative of a function g(t). 

If 0 < ~x(t) < oo, then Kx(t) = logwx(t) is called the cumulant 
generating function of X or Px. If 0 < wx(t) < co for t in a neighborhood 
of 0, then «x has a power series expansion similar to that in (1.53): 


a aby Veebee 
kx(t)= DD Stitt (1.55) 


pees 


where k;,,...,.r,’8 are called cumulants of X. There is a one-to-one correspon- 
dence between the set of moments and the set of cumulants. An example 
for the case of k = 1 is given in Exercise 68. 

When wx is not finite, finite moments of X can be obtained by differen- 
tiating its ch.f. 6x. Suppose that E| X71 ---X7,"| < oo for some nonnegative 
integers r1,...,7~. Let r=r,+---+rpz and 

Oe a 2 rp ait? X 
9) = see = YX pe, 
Oth" ++ Ob" 


Then |g(t)| < |Xj{"--+-X;*|, which is integrable. Hence, from Example 1.8, 


O” bx (t) _ r/2 T1 re Walt? X 
Ot + Ot =(-1)°E (x X;,€ ) (1.56) 


and 
O" bx (t) 


2 SSN we (f r/2 Tex? Tk 


t=0 


In particular, 


Oox (t) 
Ot 


= he x O° bx (t) 


atat™ pl eke 


t=0 


t=0 


and, if k = 1 and p is a positive integer, then ®) (9) = (—1)?/EX?, 
provided that all moments involved are finite. In fact, when k = 1, if dx 
has a finite derivative of even order p at t = 0, then EX? < oo (see, e.g., 
Chung, 1974, pp. 166-168). 


Example 1.19. Let X = N(y,0?). From Table 1.2, x(t) = e@t7"/?2, A 
direct calculation shows that EX = w, (0) = uw, EX? = v%(0) = 0? +p, 
EX3 = y®) (0) = 302p+p3, and EX* = (0) = 30446022 +p4. f= 
0, then EX? = 0 when p is an odd integer and EX” = (p—1)(p—3)---3-lo” 
when p is an even integer (exercise). The cumulant generating function of 
X is kx(t) = logdx(t) = ut+o07t?/2. Hence, «1 = ps, Kg = 07, and Kk, =0 
for r = 3,4,.... 


1.3. Distributions and Their Characteristics 35 


We now find a random variable having finite moments of all order but 
having an m.g.f. = oo except for t = 0. Let P, be the probability mea- 
sure for the N(0,02) distribution, n = 1,2,.... Then P = 0° ,2-"P, 
is a probability measure (Exercise 35). Let X be a random variable hav- 
ing distribution P. Since the m.g.f. of N(0,02) is e%"’/2, it follows from 
Fubini’s theorem that the m.g.f. of X is Wx (t) = 702, 2-"e%""/2,, When 
a2 =n?, x(t) = oo for any t 4 0 but EX* = 0 for any odd integer k and 
EX* =O, 2-"(k —1)(k—3)---1n* < 00 for any even integer k. When 
o2 =n, x(t) = (2e-”/2 — 1)-? for |t| < Vlogd and, hence, the moments 
of X can be obtained by differentiating wx. For example, EX = ¢'y (0) =0 
and EX? = $%(0)=2. 1 


A fundamental fact about ch.f.’s is that there is a one-to-one correspon- 
dence between the set of all distributions on R* and the set of all ch.f.’s 
defined on R*. The same fact is true for m.g.f.’s, but we have to focus on 
distributions having m.g.f.’s finite in neighborhoods of 0. 


Theorem 1.6. (Uniqueness). Let X and Y be random k-vectors. 

(i) If ox (t) = dy (t) for allt € R*, then Py = Py. 

(ii) If wx (t) = wy (t) < co for all t in a neighborhood of 0, then Px = Py. 
Proof. (i) The result follows from the following inversion formula whose 
proof can be found, for example, in Billingsley (1986, p. 395): for any a = 
(a1, ...,@~) € R*, b = (by,..., bx) € R*, and (a, b] = (a1, b1] x «++ x (ax, dg] 
satisfying Px (the boundary of (a, b]) = 0 


(t : tr —-V-ltj,a; e7v~ 1t;b; 
= lim ae dis sis Vee | | cM ea eS 9) 
27) 


c— 00 WEe t; 
t=1 


(ii) First consider the case of k = 1. From e‘!*! < e&® + e~8*, we con- 
clude that |X| has an m.g.f. that is finite in the neighborhood (—c,c) for 
some c > 0 and |X| has finite moments of all order. Using the inequality 
level iew lan ja0(V lax)? /5!| < |ax|"**/(n + 1)!, we obtain that 


n 


Joxtt+a oS (V=Txiev"A]| < 


j=0 


lal" B|X |"? 
(n+1)! 


j= 
which together with (1.53) and (1.56) imply that, for any t € R, 


eo (9) 
ox(t+a)= S- a a’, lal <c. (1.57) 
j=0 ‘ 


Similarly, (1.57) holds with ¢x replaced by dy. Under the assumption that 
ox = by < c ina neighborhood of 0, X and Y have the same moments of 
all order. By (1.56), 452) (0) = @¥ (0) for all j = 1,2,..., which and (1.57) 


36 1. Probability Theory 


with t = 0 imply that dx and ¢y are the same on the interval (—c,c) and 
hence have identical derivatives there. Considering t = c—e€ and —c+e for 
an arbitrarily small « > 0 in (1.57) shows that ¢x and dy also agree on 
(—2c+, 2c—e) and hence on (—2c, 2c). By the same argument ¢x and ¢y 
are the same on (—3c,3c) and so on. Hence, ¢x(t) = ¢y(t) for all t and, 
by part (i), Px = Py. 

Consider now the general case of k > 2. If Py # Py, then by part (i) 
there exists t € R* such that dx(t) 4 dy(t). Then di7x(1) # dry (1), 
which implies that Pi7x #4 Pry. But wx = wy < oo in a neighborhood of 
0 € R* implies that wy-x = Yyry < co in a neighborhood of 0 € R and, by 
the proved result for k = 1, Pirx = Pry. This contradiction shows that 
Px = Py. | 


Applying result (1.38) and Lemma 1.1, we obtain that 


bxsy(t) =¥x()by(t) and gx+y(t) =dx(t\dy, te R*®, (1.58) 


for independent random k-vectors X and Y. This result, together with 
Theorem 1.6, provides a useful tool to obtain distributions of sums of inde- 
pendent random vectors with known distributions. The following example 
is an illustration. 


Example 1.20. Let X;, i =1,...,&, be independent random variables and 
X; have the gamma distribution T'(a;,y) (Table 1.2), i = 1,...,4. From 
Table 1.2, X; has the m.g.f. px, (t) = (1 — yt)~-%, t < y71, i = 1,...,k. 
By result (1.58), the m.g.f. of Y = X, +---+ X, is equal to wy(t) = 
(1 — yt)~feat ten) t < 47! From Table 1.2, the gamma distribution 
T(a; +---+ax,7) has the m.g.f. wy(t) and, hence, is the distribution of 
Y (by Theorem 1.6). I 


Similarly, result (1.52) and Theorem 1.6 can be used to determine dis- 
tributions of linear transformations of random vectors with known distri- 
butions. The following is another interesting application of Theorem 1.6. 
Note that a random variable X is symmetric about 0 (defined according 
to (1.30)) if and only if X and —X have the same distribution, which can 
then be used as the definition of a random vector X symmetric about 0. 
We now show that X is symmetric about 0 if and only if its ch.f. dx is real- 
valued. If X and —X have the same distribution, then by Theorem 1.6, 
ox (t) = x(t). From (1.52), d-x(¢) = ¢x(—t). Then ¢x(t) = ¢x(—?). 
Since sin(—t7X) = —sin(t7X) and cos(t7X) = cos(—t7X), this proves 
E|sin(t7 X)] = 0 and, thus, ¢x is real-valued. Conversely, if ¢x is real- 
valued, then x(t) = E[cos(t7X)] and ¢_x(t) = @x(-t) = x(t). By 
Theorem 1.6, X and —X must have the same distribution. 

Other applications of ch.f.’s can be found in §1.5. 


1.4. Conditional Expectations 37 


1.4 Conditional Expectations 


In elementary probability the conditional probability of an event B given 
an event A is defined as P(B|A) = P(AM B)/P(A), provided that P(A) > 
0. In probability and statistics, however, we sometimes need a notion of 
“conditional probability” even for A’s with P(A) = 0; for example, A = 
{Y =c}, where c € R and Y is arandom variable having a continuous c.d.f. 
General definitions of conditional probability, expectation, and distribution 
are introduced in this section, and they are shown to agree with those 
defined in elementary probability in special cases. 


1.4.1 Conditional expectations 


Definition 1.6. Let X be an integrable random variable on (Q, Ff, P). 
(i) Let A be a sub-o-field of F. The conditional expectation of X given 
A, denoted by E(X].A), is the a.s.-unique random variable satisfying the 
following two conditions: 

(a) E(X|A) is measurable from (2, A) to (R, B); 

(b) f[, E(X|A)dP = f, XdP for any Ac A. 
(Note that the existence of E(X|.A) follows from Theorem 1.4.) 
(ii) Let B € F. The conditional probability of B given A is defined to be 
P(B\|A) = E(Ip|A). 
(iii) Let Y be measurable from (Q,F, P) to (A,G). The conditional expec- 
tation of X given Y is defined to be E(X|Y) = E[X|o(Y)]. 0 


Essentially, the o-field o(Y) contains “the information in Y”. Hence, 
E(X|Y) is the “expectation” of X given the information provided by o(Y). 
The following useful result shows that there is a Borel function h defined 
on the range of Y such that E(X|Y) =hoY. 


Lemma 1.2. Let Y be measurable from (Q, F) to (A,G) and Z a function 
from (Q,F) to R*. Then Z is measurable from (Q,0(Y)) to (R*, B*) if 
and only if there is a measurable function h from (A,G) to (R*, B*) such 
that Z=hoY. tf 


The function h in E(X|Y) = hoY is a Borel function on (A,G). Let 
y € A. We define 
E(X|Y = y) = Aly) 
to be the conditional expectation of X given Y = y. Note that h(y) is a 
function on A, whereas ho Y = E(X|Y) is a function on 2. 


For a random vector X, E(X|A) is defined as the vector of conditional 
expectations of components of X. 


38 1. Probability Theory 


Example 1.21. Let X be an integrable random variable on (,F, P), 
Aj, Ag,... be disjoint events on (Q, F, P) such that UA; = 2 and P(A;) > 0 
for all 2, and let a1,ag,... be distinct real numbers. Define Y = ail4, + 
dgl4, +---. We now show that 


oof, AP 
E(X|Y) = oe em (1.59) 


We need to verify (a) and (b) in Definition 1.6 with A = o(Y). Since 
a(Y) =o({A1, Ag, ...}), it is clear that the function on the right-hand side 
of (1.59) is measurable on (Q,0(Y)). For any B € B, Y~!(B) = Viea,eBAi- 
Using properties of integrals, we obtain that 


XdP = [xa 
ee 


“:aiEB 


- ay P (A; NY~*(B)) 


. le 


This verifies (b) and thus (1.59) holds. 

Let h be a Borel function on R satisfying h(ai) = J, XdP/P(Ai). 
Then, by (1.59), E(X|Y) =hoY and E(X|Y = y) = h(y). 

Let Ae F and X = 1,4. Then 


f,, XdP 


Ai 
: P(Ai) A 


P(A|Y) = E(X|Y) = 


VE 
yy 
3; 


Aid 


i=1 
which equals P(AM A;)/P(A;) = P(A|A;) if w € A;. Hence, the definition 


of conditional probability in Definition 1.6 agrees with that in elementary 
probability. 


The next result generalizes the result in Example 1.21 to conditional 
expectations of random variables having p.d.f.’s. 


Proposition 1.9. Let X be a random n-vector and Y a random m-vector. 
Suppose that (X,Y) has a joint p.d.f. f(a, y) w.r.t. v x A, where v and A 
are o-finite measures on (R”, 6”) and (R™,B™), respectively. Let g(x, y) 
be a Borel function on R"*™ for which E|g(X,Y)| < co. Then 


J 9(2,Y)f (x, Y)dv(z) 


Elg(X,Y)|Y] = J f(a, Y)dv(a) 


(1.60) 


1.4. Conditional Expectations 39 


Proof. Denote the right-hand side of (1.60) by h(Y). By Fubini’s theorem, 
h is Borel. Then, by on 1.2, h(Y) is Borel on (Q,0(Y)). Also, by 
Fubini’s theorem, fy(y) = f f(x,y)dv(a) is the p.d.f. of Y w.r.t. A. For 
Be B”, 


‘, h(Y)dP = is KaNeRy 
¥-1(B) 
J o(x,y) f(a, y)dv(x) 
= fe wane) 
= g(x,y) f(x, y)dv x r 
R"°xB 


= | g(x, y)dP¢x,y) 
R"xB 


rm if o(X,Y)aP, 
¥-1(B) 


where the first and the last equalities follow from Theorem 1.2, the second 
and the next to last equalities follow from the definition of h and p.d.f.’s, 
and the third equality follows from Theorem 1.3 (Fubini’s theorem). I 


For a random vector (X,Y) with a joint p.d.f. f(x,y) w.r.t. vx A, define 
the conditional p.d.f. of X given Y = y to be 


f(x,y) 
fxiy (aly) = 1.61 
xy (zly) Fry) (1.61) 
where fy(y) = [ f(, y)dv(2) is the marginal p.d.f. of Y w.r-t. A. One can 


easily ae that for each fixed y with fy(y) > 0, fxjy(#ly) in (1.61) is a 
p.d.f. w.r.t. v. Then equation (1.60) can be rewritten as 


Elg(X,¥)|Y] = , g(a, Y) fay (alY)dv(a). 


Again, this agrees with the conditional expectation defined in elementary 
probability (i.e., the conditional expectation of g(X,Y) given Y is equal to 
the expectation of g(X,Y) w.r.t. the conditional p.d.f. of X given Y). 


Now we list some useful properties of conditional expectations. The 
proof is left to the reader. 


Proposition 1.10. Let X, Y, Xj, X2,... be integrable random variables 
on (Q, F, P) and A be a sub-o-field of F. 

(i) If X =cas.,c€ R, then E(X|A) =cas. 

(ii) If X < Y as., then E(X|A) < E(Y|A) as. 

(iii) Ifa eR and bE R, then E(aX 4+ bY|A) = ak (X|A) 4+ bDE(Y|A) as 


40 1. Probability Theory 


(iv) E[E(X|A)] = EX. 
(v) ELE(X|A)|Ao] = E(X|Ao) = ELE(X|Ao)|A] a.s., where Ap is a sub-o- 
field of A. 

(vi) Ifo(Y) C A and E|XY| < ov, then E(XY|A) = YE(X|A) as. 

(vii) If X and Y are independent and E|g(X,Y)| < oo for a Borel function 
g, then E[g(X,Y)|Y = y] = Elg(X, y)] as. Py. 

(viii) If BX? < 00, then [E(X|A)]? < E(X2|A) as. 

(ix) (Fatou’s lemma). If X, > 0 for any n, then E (lim inf, X,,|A) < 
lim inf, E(X,,|A) a.s. 

(x) (Dominated convergence theorem). Suppose that |X,| < Y for any n 
and X;,—a.s, X. Then E(X,|A) a5. E(X|A). I 


Although part (vii) of Proposition 1.10 can be proved directly, it is a 
consequence of a more general result given in Theorem 1.7(i). Since E(X|.A) 
is defined only for integrable X , a version of monotone convergence theorem 
(ie, O< X, < Xo <--- and X, 4.5. X imply E(X,|A) a.5, E(X|A)) 
becomes a special case of Proposition 1.10(x). 


It can also be shown (exercise) that Holder’s inequality (1.40), Lia- 
pounov’s inequality (1.42), Minkowski’s inequality (1.43), and Jensen’s in- 
equality (1.47) hold a.s. with the expectation FE replaced by the conditional 
expectation E(-|A). 


As an application, we consider the following example. 


Example 1.22. Let X be a random variable on (Q,F, P) with EX? < co 
and let Y be a measurable function from (Q, F, P) to (A,G). One may wish 
to predict the value of X based on an observed value of Y. Let g(Y) be a 
predictor, i.e., g € X = {all Borel functions g with E[g(Y)]? < oo}. Each 
predictor is assessed by the “mean squared prediction error” E[X — g(Y)]?. 
We now show that E(X|Y) is the best predictor of X in the sense that 


E[X — E(X|Y)/? = min EX — 9(V)]?. (1.62) 


First, Proposition 1.10(viii) implies E(X|Y) € X. Next, for any g € X, 
E[X — g(¥)?? = E[X — E(X|Y) + E(X|Y) — g(¥)? 

= E[X — E(X|Y)|? + E/E(X|Y) — (VY)? 
+ 2E{[X — E(X|Y)E(X|Y) — 9(V)]} 
= E[X — E(X|Y)? + E[E(X|Y) —- 9(Y)?? 
+ 2E{ B{[X — E(X|Y)|[E(XIY) - g(Y)IIY}} 
E[X — E(X|Y)?? + E[E(X|Y) — g(Y)? 
+ 2E{[E(X|Y) — 9g VY) E[X — E(X|Y)|Y]} 
E[X 
E 


-—E 
X-—E 


oo 
(X 


1.4. Conditional Expectations Al 


where the third equality follows from Proposition 1.10(iv), the fourth equal- 
ity follows from Proposition 1.10(vi), and the last equality follows from 
Proposition 1.10(i), (iii), and (vi). 1 


1.4.2 Independence 


Definition 1.7. Let (0, F, P) be a probability space. 
(i) Let C be a collection of subsets in F. Events in C are said to be indepen- 
dent if and only if for any positive integer n and distinct events Aj,...,An 
in C, 

P(A, N AgN-+-N An) = P(A1)P(A2)--+ P(An). 


(ii) Collections C; C F, i € T (an index set that can be uncountable), are 
said to be independent if and only if events in any collection of the form 
{A; € C; :1 € TZ} are independent. 
(iii) Random elements X;, i € Z, are said to be independent if and only if 
o(X;),i €Z, are independent. Uf 


The following result is useful for checking the independence of o-fields. 


Lemma 1.3. Let C;, i € Z, be independent collections of events. Suppose 
that each C; has the property that if A € C; and B €C;, then ANB € Cj. 
Then o(C;), 2 €Z, are independent. I 


An immediate application of Lemma 1.3 is to show (exercise) that ran- 
dom variables X;, i = 1,...,k4, are independent according to Definition 1.7 
if and only if (1.7) holds with F' being the joint c.d.f. of (X1,..., X,) and F; 
being the marginal c.d.f. of X;. Hence, Definition 1.7(iii) agrees with the 
concept of independence of random variables discussed in §1.3.1. 


It is easy to see from Definition 1.7 that if X and Y are independent 
random vectors, then so are g(X) and h(Y) for Borel functions g and h. 
Since the independence in Definition 1.7 is equivalent to the independence 
discussed in §1.3.1, this provides a simple proof of Lemma 1.1. 

For two events A and B with P(A) > 0, A and B are independent if 
and only if P(B|A) = P(B). This means that A provides no information 
about the probability of the occurrence of B. The following result is a 
useful extension. 


Proposition 1.11. Let X be a random variable with E|X| < co and 
let Y; be random k;-vectors, i = 1,2. Suppose that (X,Y;) and Y2 are 
independent. Then 


E[X|(Y%1,Y2)] = E(X|Y%1) as. 


42 1. Probability Theory 


Proof. First, E(X|Y,) is Borel on (Q,0(¥1, Y2)), since o(¥i) C o(M%, Yo). 
Next, we need to show that for any Borel set B € B1+*2, 


i, XdP = E(X|¥1)dP. (1.63) 
(%1,Y2)~1(B) (¥1,Y2)~*(B) 


If B = B, x Bo, where B; € B, then (Yi, Y2)~!(B) = Y>1(Bi) NY, '(B2) 
and 


B(x )aP = fy aa E(X|Y,)dP 
| Sere ( | 1) Y,"(B1)"Y, "(B2) ( | 1) 


2 Pere B aa [yn yAP 
= [roe XeP [tyson dP 


=f ieyher ten XP 


| X dP, 
Y,*(Bi)nY; ‘ (Ba) 


where the second and the next to last equalities follow from result (1.38) 
and the independence of (X,Y) and Yo, and the third equality follows from 
the fact that E(X|Y1) is the conditional expectation of X given Y;. This 
shows that (1.63) holds for B = B, x By. We can show that the collection 
H={BcR**"2 : B satisfies (1.63)} is a o-field. Since we have already 
shown that B™ x BY cH, BM +*2 = o(B™ x B*2) CH and thus the result 
follows. 


Clearly, the result in Proposition 1.11 still holds if X is replaced by 
h(X) for any Borel h and, hence, 


P(A|Y1, Yo) = P(A|Yi) as. for any A € o(X), (1.64) 


if (X,Y,) and Y2 are independent. If Y; is a constant and Y = Yo, (1.64) 
reduces to P(A|Y) = P(A) as. for any A € o(X), if X and Y are inde- 
pendent, i.e., c(Y) does not provide any additional information about the 
stochastic behavior of X. This actually provides another equivalent but 
more intuitive definition of the independence of X and Y (or two o-fields). 


With a nonconstant Y,, we say that given Y;, X and Y2 are conditionally 
independent if and only if (1.64) holds. Then the result in Proposition 1.11 
can be stated as: if Y2 and (X,Y,) are independent, then given Y;, X and 
Y2 are conditionally independent. It is important to know that the result in 
Proposition 1.11 may not be true if Yj is independent of X but not (X,Y) 
(Exercise 96). 


1.4. Conditional Expectations 43 


1.4.3 Conditional distributions 


The conditional p.d.f. was introduced in §1.4.1 for random variables having 
p.d.f.’s w.r.t. some measures. We now consider conditional distributions in 
general cases where we may not have any p.d.f. 


Let X and Y be two random vectors defined on a common probability 
space. It is reasonable to consider P[X~!(B)|Y = y] as a candidate for 
the conditional distribution of X, given Y = y, where B is any Borel set. 
However, since conditional probability is defined almost surely, for any fixed 
y, P[X~1(B)|Y = y] may not be a probability measure. The first part of 
the following theorem (whose proof can be found in Billingsley (1986, pp. 
460-461)) shows that there exists a version of conditional probability such 
that P[X~!(B)|Y = y] is a probability measure for any fixed y. 


Theorem 1.7. (i) (Existence of conditional distributions). Let X be a 
random n-vector on a probability space (Q, F, P) and A be a sub-o-field of 
F. Then there exists a function P(B,w) on B” x Q such that 

(a) P(B,w) = P[X~1(B)|A] a.s. for any fixed B € B”, and 

(b) P(-,w) is a probability measure on (R”, B”) for any fixed w € Q. 
Let Y be measurable from (Q, F, P) to (A,G). Then there exists Px;y (Bly) 
such that 

(a) Pxy (Bly) = P[X~1(B)|Y = y] as. Py for any fixed B € B”, and 

(b) Pxjy(-|y) is a probability measure on (R", B”) for any fixed y € A. 
Furthermore, if E|g(X,Y)| < oo with a Borel function g, then 


Blg(X.YY =v] = Bl X=] =f ale.w)aPyiy (aly) as. Pr. 


(ii) Let (A,G, P1) be a probability space. Suppose that P2 is a function 
from 6” x A to R and satisfies 
(a) P2(-,y) is a probability measure on (R”, 6”) for any y € A, and 
(b) P2(B,-) is Borel for any B € B”. 
Then there is a unique probability measure P on (R” x A,o(B” x G)) such 
that, for Be B” and Ce G, 


P(BxC)= | Pa(B.y)dPalu). (1.65) 
Cc 

Furthermore, if (A,G) = (R™,B™), and X(z,y) =x and Y(a, y) = y define 
the coordinate random vectors, then Py = Pi, Pxjy(-ly) = Po(-,y), and 
the probability measure in (1.65) is the joint distribution of (X,Y), which 

has the following joint c.d.f.: 
F(a,y) = / Px yy ((—00, a]|z)dPy (2), rER",yER™, (1.66) 

(-co,y] 


where (—oo, a] denotes (—00, ai] x «++ X (—00, ax] for a = (ay,...,a%). 


44 1. Probability Theory 


For a fixed y, Pxjy=y = Pxjy(-ly) is called the conditional distribution 
of X given Y = y. Under the conditions in Theorem 1.7(i), if Y is a random 
m-vector and (X,Y) has a p.d.f. w.r.t. v x A (v and X are o-finite measures 
on (R”,B”) and (R™, B™), respectively), then fx ;y(a|y) defined in (1.61) 
is the p.d.f. of Pxjy=, w.r.t. v for any fixed y. 

The second part of Theorem 1.7 states that given a distribution on one 
space and a collection of conditional distributions (which are conditioned 
on values of the first space) on another space, we can construct a joint 
distribution in the product space. It is sometimes called the “two-stage 
experiment theorem” for the following reason. If Y € Rk” is selected in 
stage 1 of an experiment according to its marginal distribution Py = P,, 
and X is chosen afterward according to a distribution P2(-,y), then the 
combined two-stage experiment produces a jointly distributed pair (X,Y) 
with distribution P(x,y) given by (1.65) and Pxjy=, = Po(-,y). This pro- 
vides a way of generating dependent random variables. The following is an 
example. 


Example 1.23. A market survey is conducted to study whether a new 
product is preferred over the product currently available in the market (old 
product). The survey is conducted by mail. Questionnaires are sent along 
with the sample products (both new and old) to N customers randomly 
selected from a population, where N is a positive integer. Each customer is 
asked to fill out the questionnaire and return it. Responses from customers 
are either 1 (new is better than old) or 0 (otherwise). Some customers, 
however, do not return the questionnaires. Let X be the number of ones in 
the returned questionnaires. What is the distribution of X? 


If every customer returns the questionnaire, then (from elementary 
probability) X has the binomial distribution Bi(p,N) in Table 1.1 (as- 
suming that the population is large enough so that customers respond in- 
dependently), where p € (0,1) is the overall rate of customers who prefer 
the new product. Now, let Y be the number of customers who respond. 
Then Y is random. Suppose that customers respond independently with 
the same probability  € (0,1). Then Py is the binomial distribution 
Bi(n,N). Given Y = y (an integer between 0 and N), Pxjy=y is the bi- 
nomial distribution Bi(p, y) if y > 1 and the point mass at 0 (see (1.22)) if 
y = 0. Using (1.66) and the fact that binomial distributions have p.d.f.’s 
w.r.t. counting measure, we obtain that the joint c.d.f. of (X,Y) is 


F(e.u) =) Pryv=e( (00,21) (7) )aha = my 


1.4. Conditional Expectations 45 


for « = 0,1,...,.y, y = 0,1,...,N. The marginal c.d.f. Fx (a) = F(a,00o) = 
F(«,N). The p.d.f. of X w.r.t. counting measure is 


for c = 0,1,...,N. It turns out that the marginal distribution of X is the 
binomial distribution Bi(7p,N). I 


1.4.4 Markov chains and martingales 


As applications of conditional expectations, we introduce here two impor- 
tant types of dependent sequences of random variables. 


Markov chains 


A sequence of random vectors {X,, : n = 1,2,...} is said to be a Markov 
chain or Markov process if and only if 


P(B|X,..., Xn) = P(B|Xn) as., BEo(Xn41), n=2,3,.... (1.67) 


Comparing (1.67) with (1.64), we conclude that (1.67) implies that 
Xn+1 (tomorrow) is conditionally independent of (X14, ...,Xn—1) (the past), 
given X, (today). But (X1,...,Xn—1) is not necessarily independent of 
(Xn, Xn41)- 

Clearly, a sequence of independent random vectors forms a Markov chain 
since, by Proposition 1.11, both quantities on two sides of (1.67) are equal to 
P(B) for independent X;’s. The following example describes some Markov 
processes of dependent random variables. 


Example 1.24 (First-order autoregressive processes). Let €1,€2,... be in- 
dependent random variables defined on a probability space, X; = ¢,, and 
Xn4i = pXn + €n41, n = 1,2,..., where p is a constant in R. Then {X,,} 
is called a first-order autoregressive process. We now show that for any 
Be Bandn=1,2,... 


9 


P(Xn4i € B|X1,.., Xn) = Pz,.,(B — pXn) = P(Xnii € BlXn) a8, 


46 1. Probability Theory 


where B—y={xER: x+y € B}, which implies that {X,} is a Markov 
chain. For any y € R, 


Pep aPGartyeR\= if Ip(e+y)4P.,4, (2) 


and, by Fubini’s theorem, P:,,,,(B — y) is Borel. Hence, P-,,,,(B — pXn) 
is Borel w.r.t. o(X,) and, thus, is Borel w.r.t. o(X1,..., Xn). Let B; € B, 
j =1,..,n, and A = Nt, X;1(B;). Since engi + pXn = Xn4i and En41 
is independent of (Xj,...,X,), it follows from Theorem 1.2 and Fubini’s 
theorem that 


} P..,,,(B — pXn)dP = 
A 


| ne 
= P(AN.X,4,(B)), 


where X and x denote (Xj,..., Xn) and (21,...,2n), respectively, and tp41 
denotes px, + t. Using this and the argument in the end of the proof for 
Proposition 1.11, we obtain P(Xn41 € B\X1,...,Xn) = Pz,.,(B - pXn) 
a.s. The proof for P; 


nai (B — pXn) = P(Xn41 € B\X,) as. is similar and 
simpler. J 


The following result provides some characterizations of Markov chains. 


Proposition 1.12. A sequence of random vectors {X,,} is a Markov chain 
if and only if one of the following three conditions holds. 

(a) For any n = 2,3,... and any integrable h(X,+41) with a Borel function 
h, E[A(Xn41)|X1, --, Xn] = E[h(Xn41)|Xn] as. 

(b) For any n = 1,2,... and B € o(Xn4i, Xn42,..-), P(B|X1,..., Xn) = 
P(B|X,) as. 

(c) For any n = 2,3,..., A € o0(X,..., Xn), and B € o(Xn41, Xn42,---); 
P(AN B|Xy) = P(A|Xn)P(B|Xp) as. 

Proof. (i) It is clear that (a) implies (1.67). If h is a simple function, 
then (1.67) and Proposition 1.10(iii) imply (a). If h is nonnegative, then 
by Exercise 17 there are nonnegative simple functions hy < hy <-::<h 
such that h; — h. Then (1.67) together with Proposition 1.10(iii) and (x) 
imply (a). Since h = hh, — h_, we conclude that (1.67) implies (a). 

(ii) It is also clear that (b) implies (1.67). We now show that (1.67) implies 
(b). Note that o(Xn4i1, Xn42,.--) = 0 (Wo Caiaiay Xnay)) (Exercise 
19). Hence, it suffices to show that P(B|X1,...,Xn) = P(B|X,) a.s. for 
BE o(Xn41,--,Xn4,;) for any 7 = 1,2,.... We use induction. The result 
for 7 = 1 follows from (1.67). Suppose that the result holds for any B € 


1.4. Conditional Expectations 47 


o(Xn41,--;Xn4;). To show the result for any B € o(Xn41,..., Xn4541); 
it is enough (why?) to show that for any By € o(Xn4;41) and any By € 
o(Xn41, ss AEG) P(B, NM Bo|X1, gn) = P(B, N Bo|Xn) a.s. From the 
proof in (i), the induction assumption implies 


E[A(Xn41, +) Xn45)|X1, Xn] = E[A(Xn41,--, Xn47)|Xn] (1.68) 


for any Borel function h. The result follows from 


E(Ip,Ip,|X1,-, Xn) = E[E(Ip, [p,|X1, -) Xn4j)|X1y Xn 
= Flip, Bp, |X, 0s Xny) | X1ya5 Xel 
= Flip, E(Ip,|Xn+j)|X15 «-Xnl 
= EFllp,E(Ip,|Xn+j)|Xn 
= Ellp,E(Ip,|Xn,--)Xn+j)|Xnl 
= B[E(Ip, [p,|Xn,--)Xn+j)|Xnl 


= E(Ip, Ip, |Xn) a.s., 


where the first and last equalities follow from Proposition 1.10(v), the sec- 
ond and sixth equalities follow from Proposition 1.10(vi), the third and fifth 
equalities follow from (1.67), and the fourth equality follows from (1.68). 
(iii) Let A € o(Xq,..., Xn) and B € o(Xn41, Xn42,...). If (b) holds, then 
BUG ER N= Bela Mi, 0) Xl a Be Oy oe XO = 
Ells E([pB|Xn)|Xn] = EUa|Xn)E(Up|Xn), which is (c). 

Assume that (c) holds. Let Ay € o(X,), Ag € o(X1,...,Xn-1), and 
Be O(Xn41, Xn42; ays Then 


i: E(InIXn)aP = | Ta, E(Ip|Xp)dP 
AiNnAe2 At 
— | Blts,E(Ip|Xn)|Xn]dP 
Ay 
= | B(I4,|Xn)E(Ip|Xn)dP 
At 
Sop iia tale vaP 
At 
= P(A, M Ay NB). 


Since disjoint unions of events of the form A, A2 as specified above gener- 
ate o(X1,..., Xn), this shows that E([p|Xn) = E([p|X1,..., Xn) a.s., which 
is (b). 


Note that condition (b) in Proposition 1.12 can be stated as “the past 
and the future are conditionally independent given the present”, which is a 
property of any Markov chain. More discussions and applications of Markov 
chains can be found in §4.1.4. 


48 1. Probability Theory 


Martingales 


Let {Xn} be a sequence of integrable random variables on a probability 
space (Q,F,P) and Fy C Fo C--- C F bea sequence of o-fields such that 
a(Xn) C Fn, n = 1,2,.... The sequence {X,,F, :n = 1,2,...} is said to be 
a martingale if and only if 


E(Xn4i|Fn) = Xn as., n=1,2,..., (1.69) 


a submartingale if and only if (1.69) holds with = replaced by >, and a 
supermartingale if and only if (1.69) holds with = replaced by <. {Xp} 
is said to be a martingale (submartingale or supermartingale) if and only 
if {Xy,o(X1,...,Xn)} is a martingale (submartingale or supermartingale). 
From Proposition 1.10(v), if {Xn,Fn} is a martingale (submartingale or 
supermartingale), then so is {Xp}. 


A simple property of a martingale (or a submartingale) {Xn, Fn} is that 
E(Xn45|Fn) = Xn as. (or E(Xn4;|Fn) > Xn as.) and EX, = EX; (or 
EX, < EX, <.---) for any j = 1,2,... (exercise). 

For any probability space (0,7, P) and o-fields F; C Fo C--: CF, 
we can always construct a martingale {E(Y|F,,)} by using an integrable 
random variable Y. Another way to construct a martingale is to use a 
sequence of independent integrable random variables {¢,,} by letting X, = 
Ey t---+&n,n = 1,2,.... Since 


E(Xn41|X1, «.) Xn) = E(Xn + €n41|X1,., Xn) = Xn + Hens a.8., 


{X,,} is a martingale if Fe, = 0 for all n, a submartingale if Fe, > 0 for 
all n, and a supermartingale if Fe, < 0 for all n. Note that in Example 
1.24 with p = 1, {X,} is shown to be a Markov chain. 


The next example provides another example of martingales. 


Example 1.25 (Likelihood ratio). Let (Q,4%,P) be a probability space, 
Q be a probability measure on F, and Fi, C Fo C--: C F be a sequence 
of o-fields. Let P, and Q, be P and Q restricted to F,,, respectively, 
n = 1,2,..... Suppose that Qn, < P, for each n. Then {X,,,F,} is a 
martingale, where X,, = dQ,,/dP, (the Radon-Nikodym derivative of Qn 
w.r.t. P,), n = 1,2,... (exercise). Suppose now that {Y,,} is a sequence of 
random variables on (0,7, P), Fn = 0(Y1,.--, Yn) and that there exists a o- 
finite measure v, on F, such that Ph, X vy, and vy, < P,, n = 1,2,.... Let 
Pn(VY1, +) Yn) = €Ppr/dvy, and qn(V%1,.-; Yn) = dQn/dyy,. By Proposition 
1.7(iii), Xn = dn(M%1, ---; Yn) /Pn (M1, ---; Yn), which is called a likelihood ratio 
in statistical terms. J 


The following results contain some useful properties of martingales and 
submartingales. 


1.5. Asymptotic Theory 49 


Proposition 1.13. Let y be a convex function on R. 

(i) If {Xn, Fn} is a martingale and y(X,,) is integrable for all n, then 
{y(Xn), Fn} is a submartingale. 

(ii) If {X,, Fy} is a submartingale, y(X,,) is integrable for all n, and y is 
nondecreasing, then {y(Xn), Fn} is a submartingale. 

Proof. (i) Note that p(Xn) = y(E(Xn4il|Fn)) < Ely(Xn41|Fn)] a-s. by 
Jensen’s inequality for conditional expectations (Exercise 89(c)). 

(ii) Since y is nondecreasing and {X,, Fy} is a submartingale, y(Xn) < 
P(E(Xn41|Fn)) < Ele(Xn4i|Fn)] as. 0 


An application of Proposition 1.13 shows that if {Xn,F,} is a sub- 
martingale, then so is {(Xn)+,Fn}; if {Xn, Fn} is a martingale, then 
{|Xn|, Fn} is a submartingale and so are {|Xy|?, Fn}, where p > 1 is a con- 
stant, and {|X,|(log|Xn|)+, Fn}, provided that |X,,|? and |X,,|(log|Xn|)+ 
are integrable for all n. 


Proposition 1.14 (Doob’s decomposition). Let {Xn, Fn} be a submartin- 
gale. Then X, = Y, + Zn, n = 1,2,..., where {Y,,F,} is a martin- 
gale, O = Z < Z < ---, and EZ, < oo for all n. Furthermore, if 
sup, E|Xn| < co, then sup, E|Y;,| < oo and sup, EZ, < cw. 

Proof. Define 1 &, Gy 0, Nn Kay Xn-1 E(Xn Xn—-1|Fn-1), and 
Ge = BX a Gil Faca) for So Phen y= > aap ond Zn eG 
satisfy X,, = Y, + Z, and the required conditions (exercise). 


Assume now that sup,, E|X,| < oo. Since EY, = EY,, for any n and 
Zn < |Xn| — Yn, EZ, < E|Xn| — EYi. Hence sup,, EZ, < co. Also, 
[Yn] <|Xn| + Z,. Hence sup, E|Y¥n|<oo. I 


The following martingale convergence theorem, due to Doob, has many 
applications (see, e.g., Example 1.27 in §1.5.1). Its proof can be found, for 
example, in Billingsley (1986, pp. 490-491). 


Proposition 1.15. Let {X,,F,} be asubmartingale. If c = sup, E|Xn| < 
oo, then limp. Xn = X a.s., where X is a random variable satisfying 
E|X|<c t 


1.5 Asymptotic Theory 


Asymptotic theory studies limiting behavior of random variables (vectors) 
and their distributions. It is an important tool for statistical analysis. A 
more complete coverage of asymptotic theory in statistical analysis can 
be found in Serfling (1980), Shorack and Wellner (1986), Sen and Singer 
(1993), Barndorff-Nielsen and Cox (1994), and van der Vaart (1998). 


50 1. Probability Theory 


1.5.1 Convergence modes and stochastic orders 


There are several convergence modes for random variables/vectors. Let 
r > 0 be a constant. For any c = (c1,...,ck) € R*, we define |lcl|,, = 
es lc;|")!/". If r > 1, then |lcl|, is the L,-distance between 0 and c. 
When r = 2, the subscript r is omitted and ||e|| = ||ell2 = Vere. 


Definition 1.8. Let X, X,, X2,... be random k-vectors defined on a prob- 
ability space. 

(i) We say that the sequence {X,,} converges to X almost surely (a.s.) and 
write X, 4.5. X if and only if limp. Xn = X as. 

(ii) We say that {.X,,} converges to X in probability and write X, -, X 
if and only if, for every fixed € > 0, 


lim P(||X,—X|| > 6) =0. (1.70) 


(iii) We say that {X,,} converges to X in L, (or in rth moment) and write 
Xp — z, X if and only if 


lim El|X, — X|/" =0, 


where r > 0 is a fixed constant. 

(iv) Let F, Fy, n =1,2,..., be c.d-f.’s on R*® and P, Py, n = 1,..., be their 
corresponding probability measures. We say that {F,,} converges to F 
weakly (or {P,} converges to P weakly) and write Fy, > F (or Ph -w P) 
if and only if, for each continuity point x of F, 


lim F,,(a) = F(a). 
We say that {X,} converges to X in distribution (or in law) and write 
Xp —a X if and only if Fy, -~» Fx. Jt 


The a.s. convergence has already been considered in previous sections. 
The concept of convergence in probability, convergence in L,, or a.s. con- 
vergence represents a sense in which, for n sufficiently large, X, and X 
approximate each other as functions on the original probability space. The 
concept of convergence in distribution in Definition 1.8(iv), however, de- 
pends only on the distributions Fx, and Fx (or probability measures Px,, 
and Px) and does not necessitate that X,, and X are close in any sense; in 
fact, Definition 1.8(iv) still makes sense even if X and X,,’s are not defined 
on the same probability space. In Definition 1.8(iv), it is not required that 
limp—oo Fn(x) = F(x) for every x. However, if F is a continuous function, 
then we have the following stronger result. 


1.5. Asymptotic Theory 51 


Proposition 1.16 (Pédlya’s theorem). If F, -  F and F is continuous on 
R*, then 
lim sup |F,(z)—F(«)|=0. I 


MOO DERE 


A useful characterization of a.s. convergence is given in the following 
lemma. 


Lemma 1.4. For random k-vectors X,X1, X2,... on a probability space, 
Xn —a.s. X if and only if for every e > 0, 


lim P ( U (Xe = aS a) =0. (1.71) 


m=n 


Proof. Let Aj = U%, N%_,, {\|[Xm — X|| < j~'}, 7 = 1,2,.... By Propo- 
sition 1.1 (iii) and DeMorgan’s law, (1.71) holds for every € > 0 if and only 
if P(Aj) = 1 for every j, which is equivalent to P(N72,A;) = 1. The result 


follows from N72, Aj = {w: limnsoo Xn(w) = X(w)} (exercise). Il 


The following result describes the relationship among the four conver- 
gence modes in Definition 1.8. 


Theorem 1.8. Let X,X 1, Xo,... be random k-vectors. 

(i) If X, as. X, then X, yp X. 

(ii) If X, 1, X for an r > 0, then X, —-, X. 

(iii) If Xp, +p X, then Xp ¢ X. 

(iv) (Skorohod’s theorem). If Xp, —a X, then there are random vectors 
Y,Y1, Yo,... defined on a common probability space such that Py = Px, 
Py, = Px, n=1,2,..., and Yn as. Y. 

(v) If, for every € > 0, 77°, P(||Xn — X|| > ©) < co, then Xp as, X. 
(vi) If X, —, X, then there is a subsequence {X,,,j = 1,2,...} such that 
Xn; as, X as j > OO. 

(vii) If X, +a X and P(X =c) = 1, where c € R* is a constant vector, 
then X, —p ©. 

(viii) Suppose that X, 4 X. Then, for any r > 0, 


lim E\|Xqll" = IX |. < 00 (1.72) 
if and only if {||X,||"} is uniformly integrable in the sense that 


jim sup £ (| Xnll Ley x0) .>8}) =0. I (1.73) 


The proof of Theorem 1.8 is given after the following discussion and 
example. 


52 1. Probability Theory 


The converse of Theorem 1.8(i), (ii), or (iii) is generally not true (see 
Example 1.26 and Exercise 116). Note that part (iv) of Theorem 1.8 (Sko- 
rohod’s theorem) is not a converse of part (i), but it is an important result 
in probability theory. It is useful when we study convergence of quantities 
related to Fy, and Fx when X,;, 4 X (see, e.g., the proofs of Theorems 
1.8 and 1.9). Part (v) of Theorem 1.8 indicates that the converse of part (i) 
is true under the additional condition that P(||X,,—X|] > €) tends to 0 fast 
enough. Part (vi) provides a partial converse of part (i) whereas part (vii) is 
a partial converse of part (iii). A consequence of Theorem 1.8(viii) is that if 
Xn —p X and {||X,, — X||"} is uniformly integrable, then X, 1, X; ie., 
the converse of Theorem 1.8(ii) is true under the additional condition of 
uniform integrability. A useful sufficient condition for uniform integrability 
of {||X,||7} is that 

sup E||X,,||"*° < co (1.74) 


for ad > 0. Some other sufficient conditions are given in Exercises 117-120. 


Example 1.26. Let 6, = 1+n7! and X, be a random variable having 
the exponential distribution E(0,6,) (Table 1.2), n = 1,2,.... Let X be 
a random variable having the exponential distribution (0,1). For any 
xz >0, 

Fx, (a) =1—e7*/9 1 — e-® = Fx(z) 
as n — oo. Since Fy, (x) = 0 = Fx(ax) for x < 0, we have shown that 
Xn —ad Xx. 


Is it true that X, —-, X? This question cannot be answered without any 
further information about the random variables X and X,,. We consider 
two cases in which different answers can be obtained. First, suppose that 
Xn = 6,X (then X,, has the given c.d-f.). Note that X,—X = (6, -1)X = 
n—'X, which has the c.d.f. (1 — e~"”)Ijo,o0) (a). Hence 


PURE A Se Se 0 


for any € > 0. In fact, by Theorem 1.8(v), Xn —a.s. X; since E|X,—X|? = 
n-PEX? < oo for any p > 0, Xn >1, X for any p > 0. Next, suppose 
that X,, and X are independent random variables. Using result (1.28) 
and the fact that the p.d-f.’s for X, and —X are 67 1e7*/% Ig...)(x) and 
e*I(_co,0)(#), respectively, we obtain that 


P(Xn—X1<0= ff 62e*!e¥ "0.05 (2) co) dey, 
which converges to (by the dominated convergence theorem) 


/ fete Too) coe (y)dady =1-—e~*. 


1.5. Asymptotic Theory 53 


Thus, P (|X, —X|>e) — e-* > 0 for any € > 0 and, therefore, {X,,} 
does not converge to X in probability. The previous discussion, however, 
indicates how to construct the random variables Y, and Y in Theorem 
1.8(iv) for this example. I 


The following famous result is used in the proof of Theorem 1.8(v). Its 
proof is left to the reader. 


Lemma 1.5. (Borel-Cantelli lemma). Let A, be a sequence of events in a 
probability space and limsup,, An = NP, UR, Am- 

(i) If oP? P(An) < co, then P(limsup,, A,) = 0. 

(ii) If Ay, Ag,... are pairwise independent and $7, P(An) = oo, then 
P(limsup,, An) =1. 0 


Proof of Theorem 1.8. (i) The result follows from Lemma 1.4, since 
(1.71) implies (1.70). 

(ii) The result follows from Chebyshev’s inequality with y(t) = |t|”. 

(iii) For any c = (ci, ..., ch) € R*, define (—0o, ce] = (—00, e1] x: --x (—00, cx]. 
Let x be a continuity point of Fy, € > 0 be given, and J, be the k-vector 
of ones. Then {X € (—o0, 7 — €Jy], Xn € (—o0, a]} C {|| Xn — X|| > €} and 


Fx (x — eJp) = P(X €(—co, x — €Ji]) 
< P(Xn€(—00, 2]) + P(X €(—00, t — eJh], Xn ¢(—00, a]) 
< Fx, (2) + P (Xn — X|| >). 


Letting n — oo, we obtain that Fx (a — eJ,) < liminf, Fx, (x). Similarly, 
we can show that Fx (x + eJ;,) > limsup,, F'x,,(x). Since € is arbitrary and 
Fx is continuous at 7, Fx (x) = limp F'x, (x). 

(iv) The proof of this part can be found in Billingsley (1986, pp. 399-402). 
(v) Let A, = {||Xn—X]|| > e}. The result follows from Lemma 1.4, Lemma 
1.5(i), and Proposition 1.1 (iii). 

(vi) From (1.70), for every 7 = 1,2,..., there is a positive integer nj such 
that P(||X,,; —X|| > 27%) < 27/. For any € > 0, there is a k, such that for 
f= he, P(\|Xnj — Xl > €) < P(||Xn, — Xl] > 274). Since 2%, 2-7 = 1, it 
follows from the result in (v) that X;,; -a.s. X as j > oo. 

(vii) The proof for this part is left as an exercise. 

(viii) First, by part (iv), we may assume that X, 4.5. X (why?). Assume 
that {||Xn||~} is uniformly integrable. Then sup, E||Xn||~ < oo (why?) 
and by Fatou’s lemma (Theorem 1.1(i)), E||X||~ < liminf, E'l|Xy||~ < oo. 
Hence, (1.72) follows if we can show that 


lim sup B|| Xn" < BIX |. (1.75) 


For any « > Oandt > 0, let A, = {||X,—X]], < e} and B, = {||Xn||, > t}. 


54 1. Probability Theory 


Then 
B|Xall" = E(\|Xall2Laene,) + E(Xall’Lagnas) + E(Xnll"Za,) 
< B(|Xnll"Ep,) +t P(AS) + BI XnLa, I 
For r <1, |Xnla, lin < (Xn — XII + ||XINL4,, and 
E\|XnLa, llr < El(||Xn — Xl + XP) La,] < + EllXlp. 
For r > 1, an application of Minkowski’s inequality leads to 


E\|XnLa,|lp = El|(Xn — X)La, + XL, |r 
< Bl Xn — X)Lag lle + WX LAa lel” 
< {{EI(Xn — X)La, I” + EXE, ey} 


< {e+ Bxiy/"} 


In any case, since € is arbitrary, lim sup,, E||X,La,, ||~ < E||X||". This result 
and the previously established inequality imply that 


lim sup E||X,||" < limsup E(||Xn||"Jze,,) +t” lim P(AS) 
+lim sup E||X, La, ||° 


< sup E(||XnllrLeyx, 4-23) + EXIF, 


since P(AS) — 0. Since {||X,||7} is uniformly integrable, letting t — oo we 
obtain (1.75). 


Suppose now that (1.72) holds. Let &) = ||Xn||7JBe — ||X||7Jpe. Then 
En —a.s. 0 and |€,| < ¢” + ||X||", which is integrable. By the dominated 
convergence theorem, E€,, — 0; this and (1.72) imply that 


E(||Xn|l-Ze,,) — E(||Xll-Ze,,) > 0. 


From the definition of B,, Bn, C {\|Xn — X]], > t/2} U {||X]], > t/2}. 
Since E||X||— < oo, it follows from the dominated convergence theorem 
that E(||X||PL.x,—X||,>t/2}) — O0asn— oo. Hence, 


lim sup F(||Xn||-Jz,,) < limsup E(||X||-Jz,,) < E(X || PLg xq, 54/23): 


Letting t — oo, it follows from the dominated convergence theorem that 


jim lim sup E(||Xnl|-e,) < im E(||X|l-L\.x1),>1/2}) = 0- 


This proves (1.73). 


1.5. Asymptotic Theory 55 


Example 1.27. As an application of Theorem 1.8(viii) and Proposition 
1.15, we consider again the prediction problem in Example 1.22. Suppose 
that we predict a random variable X by arandom n-vector Y = (Yj,..., Yn). 
It is shown in Example 1.22 that X, = E(X|Y1,..., Yn) is the best predictor 
in terms of the mean squared prediction error, when EX? < oo. We now 
show that X, —a.s. X when n — co under the assumption that o(X) C 
Foo = 0(¥1,Yo,...) (i-e., X provides no more information than Yj, Y9,...). 

From the discussion in §1.4.4, {X,} is a martingale. Also, sup,, E|Xy| < 
sup, E[E(|X||%1,..., Yn)] = E|X| < co. Hence, by Proposition 1.15, Xy, 
—a.s. Z for some random variable Z. We now need to show Z = X as. 
Since 0(X) C Fa, X = E(X|F..) a.s. Hence, it suffices to show that Z = 
E(X|Foo) a.s. Since EX? < EX? < 00 (why?), condition (1.74) holds for 
sequence {|X,,|} and, hence, {|X,|} is uniformly integrable. By Theorem 
1.8(viii), E|X;, — Z| — 0, which implies [, X,dP — {, ZdP for any event 
A. Note that if A € o(¥%,...,¥m), then A € o(%,...,¥,) for n > m and 
Jf, XndP = J,XdP. This implies that for any A € UZ,0(Y1,.-.,Y;), 
J, XdP = Jf, ZdP. Since UZ,0("%,..., Yj) generates F., we conclude 
that [, XdP = {, ZdP for any A € F, and thus Z = E(X|F,.) as. 


In the proof above, the condition EX? < oo is used only for showing the 
uniform integrability of {|X,|}. But by Exercise 120, {|X,,|} is uniformly 
integrable as long as E|X| < co. Hence X, 4.5. X is still true if the 
condition EX? < oo is replaced by E|X|< oo. I 


We now introduce the notion of O(-), o(-), and stochastic O(-) and 
o(-). In calculus, two sequences of real numbers, {a,} and {b,}, satisfy 
Gn = O(b,) if and only if |a,| < clb,| for all n and a constant c; and 
Gy = 0(b,) if and only if a,/b, + 0 as n > ov. 


Definition 1.9. Let X1, X2,... be random vectors and Yj, Yo,... be random 
variables defined on a common probability space. 

(i) X, = O(Y,,) as. if and only if P(||X,|| = O(|Y,|)) = 1. 

(ii) X, = o(Y;,) as. if and only if X,/¥n —a.s. 0. 

(iii) X, = O,(Y,) if and only if, for any € > 0, there is a constant C, > 0 
such that sup, P(||Xn|| > CelYn|) < ¢. 

(iv) Xn = Op(Yp) if and only if X,/Y, —-,0. I 


Note that Xp, = 0,(Y;,) implies X,, = Op(Yn); Xn = Op(Yn) and Y,; = 
Op(Zn) implies X, = Op(Zn); but X, = Op(Yn) does not imply Y,, = 
Op(Xn). The same conclusion can be obtained if O,(-) and o,(-) are 
replaced by O(- ) a.s. and o(-) a.s., respectively. Some results related to Op 
are given in Exercise 127. For example, if X,—q X for a random variable 
X, then X;, = O,(1). Since a, = O(1) means that {a,,} is bounded, {X,} 
is said to be bounded in probability if X, = O,(1). 


56 1. Probability Theory 
1.5.2 Weak convergence 


We now discuss more about convergence in distribution or weak conver- 
gence of probability measures. A sequence {P,,} of probability measures 
on (R*,B*) is tight if for every « > 0, there is a compact set C Cc R* 
such that inf, P,(C) > 1—. If {X,} is a sequence of random k-vectors, 
then the tightness of { Px, } is the same as the boundedness of {||X;,||} in 
probability. The proof of the following result can be found in Billingsley 
(1986, pp. 392-395). 


Proposition 1.17. Let {P,,} be a sequence of probability measures on 
CREB). 

(i) Tightness of {P,,} is a necessary and sufficient condition that for every 
subsequence {P,,,} there exists a further subsequence {Pn,} C {Pn,} and 
a probability measure P on (R*, B*) such that P,,; +w P as j — oo. 

(ii) If {P,} is tight and if each subsequence that converges weakly at all 
converges to the same probability measure P, then P, -y»P. I 


The following result gives some useful sufficient and necessary conditions 
for convergence in distribution. 


Theorem 1.9. Let X, X 1, X2,... be random k-vectors. 
(i) Xp a X is equivalent to any one of the following conditions: 

(a) E[h(X,)| — E[h(X)] for every bounded continuous function h; 

(b) limsup,, Px,,(C) < Px(C) for any closed set C c R*; 

(c) liminf, Px, (O) > Px(O) for any open set O C R*. 
(ii) (Lévy-Cramér continuity theorem). Let x, éx,,x.,-.. be the ch.f.’s of 
X,X1, Xo,..., respectively. X,, >q X if and only if limp... ¢x,, (t) = dx(t) 
for allt € R*. 
(iii) (Cramér-Wold device). X, —a X if and only if c’X, —q@ c’X for 
every cE R*. 
Proof. (i) First, we show X, —q X implies (a). By Theorem 1.8(iv) 
(Skorohod’s theorem), there exists a sequence of random vectors {Y,,} 
and a random vector Y such that Py, = Px, for all n, Py = Px and 
Yn —a.s. Y. For bounded continuous h, h(Yn) —-a.s. h(Y) and, by the 
dominated convergence theorem, E[h(Yn, = - £E iA(Y)). Then (a) follows 
from E[h(X,,)] = E[A(Y,)] for all n and E[h(X)|] = E[A(Y)]. 


Next, we show (a) implies (b). Let C’ be a closed set and fo(x) = 
inf{||z — yl| : y € C}. Then fo is continuous. For j = 1,2,..., define 
y(t) = T-00,0) + — jt)I(o,5- 1). Then h;(x) = 9;(fe(«)) is continuous 
and bounded, h; > hj4i, 7 = 1,2,..., and i j(x) — Ig(x) as j > oo. Hence 
lim sup,, Px,, (C) < limp E[h;(Xn)] = "Blh, (X)] for each j (by (a)). 
By the dominated convergence theorem, E[h;(X)] — E[I[co(X)] = Px(C). 


1.5. Asymptotic Theory 57 


This proves (b). 

For any open set O, O° is closed. Hence, (b) is equivalent to (c). Now, we 
show (b) and (c) imply X, -a X. For x = (a1,...,2%) € R*, let (—oo, z] = 
(—00, 41] X +++ X (—00, eg] and (—o0, x) = (—00, 41) X +++ x (—00, zy). From 
(b) and (c), Px((—co,z)) < liminf, Px, ((—0o0, z)) < liminf, Fx, (x) < 
lim sup,, Fx,,(x) = limsup,, Px, ((—00,2]) < Px((-00,2]) = Fx(x). If 
x is a continuity point of Fx, then Px ((—oo,x)) = Fx(x). This proves 
Xn —q X and completes the proof of (i). 

(ii) From (a) of part (i), X, a X implies $x, (t) > ¢x(t), since ev!" * = 
cos(t7x) + /—Isin(t7 x) and cos(t7x) and sin(t7x) are bounded continuous 
functions for any fixed t. 

Suppose now that k = 1 and that ¢x, (t) — ¢x(t) for every te R. By 

Fubini’s theorem, 


» fr ox,iejae= fo ]* fae at] apy, (a) 


US—u —oo 


z 2 | (1 = Se dPx, (x) 
ath ux 
1 
>of (1-2) aps, (e) 
{|x| >2u-1} |war| 


> Px, ((—00, —2u7*) U (2u7*, 00)) 


for any u > 0. Since ¢x is continuous at 0 and ¢x (0) = 1, for any « > 0 
there is a u > 0 such that u—! f™ [1 — dx(t)]dt < €/2. Since dx, > ox, 
by the dominated convergence theorem, sup, {u~! Jf“, [1 — ox, (t)]dt} < e. 
Hence, 


1 
inf Px, ([-2u7',2u7"]) > 1—sup {=< | 
n n U 


—U 


uU 


[1 — bx,, enjae >1l-e, 


ie., {Px,, } is tight. Let {Px,,} be any subsequence that converges to a 
probability measure P. By the first part of the proof, OK. — @, which is 
the ch.f. of P. By the convergence of dx,, ¢ = dx. By Theorem 1.6(i), 
P= Px. By Proposition 1.17(ii), Xn a X. 

Consider now the case where k > 2 and ox, — ox. Let Y,; be the jth 
component of X,, and Y; be the jth component of X. Then ¢y,, — dy; 
for each j7. By the proof for the case of k = 1, Yn; -a Y;. By Proposition 
1.17(i), {Py,,,} is tight, 7 = 1,...,k. This implies that {Px,, } is tight (why?). 
Then the proof for X, —q X is the same as that for the case of k = 1. 
(iii) From (1.52), ¢erx,, (u) = bx, (uc) and ¢e-x(u) = ¢x(uc) for any 
u € Rand any c€ R*. Hence, convergence of ¢x, to dx is equivalent to 
convergence of ¢¢rx,, to derx for every c € R*. Then the result follows 
from part (ii). 0 


58 1. Probability Theory 


Example 1.28. Let Xj,...,X, be independent random variables having 
a common c.d.f. and T, = X, +---+ Xn, n = 1,2,..... Suppose that 
E|X,| < co. It follows from (1.56) and a result in calculus that the ch.f. of 
X, satisfies 


Ox, (t) = Ox, (0) + V—Ipt + oft) 
as |t] > 0, where p = E.X,. From (1.52) and (1.58), the ch.f. of T,,/n is 


br, n(t) = lox. (=)] = ps o(Z)] 


” _, e© for any complex sequence 


for any t € R, as n — oo. Since (1+c¢p/n) 
{en} satisfying c, — c, we obtain that ¢7, /n(t) > eV—l#t, which is the 
ch.f. of the distribution degenerated at yu (i.e., the point mass probability 
measure at py; see (1.22)). By Theorem 1.9(ii), T,/n >a pu. From Theorem 


1.8(vii), this also shows that T,,/n >» pu. 
Similarly, 4. = 0 and o? = Var(X1) < co imply 


242 2 n 
oid [1-22 +0(2) 


for any t € R, which implies that ¢7,; j7(t) — e-? © /2. the chf. of 
N(0,07). Hence T,,/\/n a N(0,07). (Recall that N(y,07) denotes a 
random variable having the N(y, 07) distribution.) If ~ 4 0, a transforma- 
tion of Y; = X; — p leads to (Tp — np) /./n a N(0, 07). 

Suppose now that Xj,...,X, are random k-vectors and wu = EX, and 
© = Var(X1) are finite. For any fixed c € R*, it follows from the previous 
discussion that (eT, — nc™ )/./n >a N(0,c7Xc). From Theorem 1.9(iii) 
and a property of the normal distribution (Exercise 81), we conclude that 
(Tn — np)/\/n >a Nz(0,E). 0 


Example 1.28 shows that Theorem 1.9(ii) together with some properties 
of ch.f.’s can be applied to show convergence in distribution for sums of 
independent random variables (vectors). The following is another example. 


Example 1.29. Let Xj,...,X, be independent random variables having a 
common Lebesgue p.d.f. f(x) = (1 — cosx)/(mx?). Then the ch.f. of X; is 
max{1— |t|,0} (Exercise 73) and the ch.f. of T,/n = (X1 +---+Xn)/n is 


t n 
(max { _ Hoh) => etl, tER. 
n 


Since e~!"! is the ch.f. of the Cauchy distribution C(0,1) (Table 1.2), we 
conclude that T,,/n +a X, where X has the Cauchy distribution C(0, 1). 


Does this result contradict the first result in Example 1.28? J 


1.5. Asymptotic Theory 59 


Other examples of applications of Theorem 1.9 are given in Exercises 
135-140 in 81.6. The following result can be used to check whether Xp, —a 
X when X has a p.d.f. f and X, has a p.d-f. fin. 


Proposition 1.18 (Scheffé’s theorem). Let {f,} be a sequence of p.d.f.’s 
on R* w.r.t. a measure v. Suppose that limn—oo fn(x) = f(a) a.e. v and 
f(z) isap.df. w.r.t. v. Then limp f |fn(v) — f(x)|dv = 0. 

Proof. Let gn(x) = [f(z) — fn(z)|L¢¢>7,}(2), 2 = 1,2,.... Then 


[itate)- F(o)lav =2 fg n(a)dv. 


Since 0 < gn(x) < f(x) for all « and g, > 0 ae. v, the result follows from 
the dominated convergence theorem. I 


As an example, consider the Lebesgue p.d.f. f, of the t-distribution ty, 
(Table 1.2), n = 1,2,.... One can show (exercise) that f, — f, where f is 
the standard normal p.d.f. This is an important result in statistics. 


1.5.3 Convergence of transformations 


Transformation is an important tool in statistics. For random vectors X, 
converging to X in some sense, we often want to know whether g(X,) 
converges to g(X) in the same sense. The following result provides an 
answer to this question in many problems. Its proof is left to the reader. 


Theorem 1.10. Let X,X1, X2,... be random k-vectors defined on a prob- 
ability space and g be a measurable function from (R*,B*) to (R', B’). 
Suppose that g is continuous a.s. Py. Then 

(i) Xn —a.s. X implies g(Xn) as. 9(X); 

(ii) X, —p X implies g(Xn) py g(X); 

(iii) X, -q X implies g(Xn) a g(X). 1 


Example 1.30. (i) Let X1, X2,... be random variables. If X, -a X, 
where X has the N(0,1) distribution, then X? aq Y, where Y has the 
chi-square distribution x7 (Example 1.14). 

(ii) Let (X,, Yn) be random 2-vectors satisfying (Xn, Yn) a (X,Y), where 
X and Y are independent random variables having the N (0, ‘) distribution, 
then Xp/Yn—a X/Y, which has the Cauchy distribution C(0, 1) (§1.3.1). 
(iii) Under the conditions in part (ii), max{X,, Yn} —¢ max{X,Y}, which 
has the c.d.f. [®(x)]? (®(z) is the c.d-f. of N(0,1)). I 


In Example 1.30(ii) and (iii), the condition that (Xn,Yn) a (X,Y) 
cannot be relaxed to X, -a X and Y, a Y (exercise); i.e., we need the 


60 1. Probability Theory 


convergence of the joint c.d.f. of (X,,Y,). This is different when —q is re- 
placed by —,» or —,.5,. The following result, which plays an important role 
in probability and statistics, establishes the convergence in distribution of 
Xnt+Y, or XnYp when no information regarding the joint c.d.f. of (Xn, Yn) 
is provided. 


Theorem 1.11 (Slutsky’s theorem). Let X,X1, X2,..., Yi, Yo,... be ran- 
dom variables on a probability space. Suppose that X, —q X and Y, —» ¢, 
where c is a fixed real number. Then 

(i) Xn +, Yn a X +6; 

(ii) YnXn a CX; 

(iii) Xn/VYn ta X/cifc #0. 

Proof. We prove (i) only. The proofs of (ii) and (iii) are left as exercises. 
Let t € R and € > 0 be fixed constants. Then 


Fx, +Y, (t) = P(Xy, +Y,< t) 


(X, <t-—c+e)+P(l¥n —c¢l >) 


and, similarly, 


Fy sy, (t) > P(Xn <t—c—6)—P(|Yn —¢e| > ©). 


If t—c, t—c+e, and t —c—€ are continuity points of Fx, then it follows 
from the previous two inequalities and the hypotheses of the theorem that 


Fx (t —c—e) < liminf Fx, +y, (t) < limsup Fx, +y, (t) < Fx(t— c+ €). 
Since € can be arbitrary (why?), 
lim Fx, 4Y, (t) = Fx(t = C). 
The result follows from Fx4-(t)=Fx(t—c). I 


An application of Theorem 1.11 is given in the proof of the following 
important result. 


Theorem 1.12. Let X1, X2,... and Y be random k-vectors satisfying 
an(Xn —c) a Y, (1.76) 


where c € R* and {an} is a sequence of positive numbers with limp @n = 
oo. Let g be a function from R* to R. 
(i) If g is differentiable at c, then 


anlg(Xn) — g(c)] a [Va(e)]’Y, (1.77) 


1.5. Asymptotic Theory 61 


where Vg(a) denotes the k-vector of partial derivatives of g at x. 

(ii) Suppose that g has continuous partial derivatives of order m > 1 ina 
neighborhood of c, with all the partial derivatives of order 7, 1 < 7 < m—1, 
vanishing at c, but with the mth-order partial derivatives not all vanishing 
atc. Then 


k k 
i 1 ang : 
an [g(Xn) — g(c)] a ml D oes Crier a pe Vesa (Ue78) 


y=1 m= 
where Y; is the jth component of Y. 
Proof. We prove (i) only. The proof of (ii) is similar. Let 
Zn, = An[g(Xn) — g(c)] — an[Va(c)]" (An — ©). 
If we can show that Z, = 0,(1), then by (1.76), Theorem 1.9(iii), and 
Theorem 1.11(i), result (1.77) holds. 


The differentiability of g at c implies that for any « > 0, there isa 6. > 0 
such that 


lg(x) — ge) — [Va(o)]" (@ — ©) S ella — el (1.79) 
whenever ||x — cl] < 6-. Let 7 > 0 be fixed. By (1.79), 
P(|Zn| 29) < P(||Xn — ell 2 de) + P(an||Xn — el] 2 n/€)- 


Since a, — 00, (1.76) and Theorem 1.11(ii) imply X, —, c. By Theorem 
1.10(iii), (1.76) implies a,||X,— cl] a ||Y ||. Without loss of generality, we 
can assume that 7/e is a continuity point of F\yy\. Then 


limsup P(|Zn| >7) < lim P(||Xy — cl] > 6.) 
+ lim P(a,||Xn— cll > 7/6) 
= P(|¥|| 2 n/e). 
The proof is complete since € can be arbitrary. I 
In statistics, we often need a nondegenerated limiting distribution of 
an{g(Xn) — g(c)] so that probabilities involving a,[g(Xn) — g(c)] can be 
approximated by the c.d.f. of [Vg(c)]7Y, if (1.77) holds. Hence, result 


(1.77) is not useful for this purpose if Vg(c) = 0, and in such cases result 
(1.78) may be applied. 

A useful method in statistics, called the delta-method, is based on the 
following corollary of Theorem 1.12. 


Corollary 1.1. Assume the conditions of Theorem 1.12. If Y has the 
N;,,(0, %) distribution, then 


anlg(Xn) — g(c)] >a N (0,[Vg(e)"=Vg(c)). 0 


62 1. Probability Theory 


Example 1.31. Let {X,,} be a sequence of random variables satisfying 
J/n(Xn,—c) +a N(0,1). Consider the function g(x) = x”. If c £0, then an 
application of Corollary 1.1 gives that /n(X?2—c?) >¢ N(0,4c?). Ife = 0, 
the first-order derivative of g at 0 is 0 but the second-order derivative of 
g = 2. Hence, an application of result (1.78) gives that nX? 4 [N(0,1)]?, 
which has the chi-square distribution x7 (Example 1.14). The last result 
can also be obtained by applying Theorem 1.10(iii). I 


1.5.4 The law of large numbers 


The law of large numbers concerns the limiting behavior of sums of indepen- 
dent random variables. The weak law of large numbers (WLLN) refers to 
convergence in probability, whereas the strong law of large numbers (SLLN) 
refers to a.s. convergence. 

The following lemma is useful in establishing the SLLN. Its proof is left 
as an exercise. 


Lemma 1.6. (Kronecker’s lemma). Let t, € R, an € R, 0 < an < 
An41, 2 = 1,2,..., and a, — oo. If the series 77°, tn/a, converges, then 
a, ti 0. Od 


Our first result gives the WLLN and SLLN for a sequence of independent 
and identically distributed (i.i.d.) random variables. 


Theorem 1.13. Let X,, X2,... be i.i.d. random variables. 
(i) (The WLLN). A necessary and sufficient condition for the existence of 
a sequence of real numbers {a,,} for which 


1 n 
= S "Xi — Gn Sp 0 (1.80) 
t=1 


is that nP(|Xi| >) — 0, in which case we may take ay, = E( X11) x,)<n})- 
(ii) (The SLLN). A necessary and sufficient condition for the existence of a 
constant c for which 


1 n 
—S 7 Xi as. € (1.81) 
u t=1 


is that E|X1| < oo, in which case c = EX, and 


1 n 
= S- ei (X; -_ EX) —a.s, 0 (1.82) 
n 


i=1 


for any bounded sequence of real numbers {c; }. 


1.5. Asymptotic Theory 63 


Proof. (i) We prove the sufficiency. The proof of necessity can be found 
in Petrov (1975). Consider a sequence of random variables obtained by 
truncating X,’s at n: Ynj = Xj 1g)x;\<n}- Let T, = X, +---+ Xp, and 
Zn = Ynit+:::+Ynn. Then 


P(Ty £ Zn) < 2 P(Ynj # Xj) =nP(|Xi| > n) > 0. (1.83) 


j=1 


For any € > 0, it follows from Chebyshev’s inequality that 
Zn -— EZy 
a 

n 


where the last equality follows from the fact that Y,;, 7 = 1,...,n, are iid. 
From integration by parts, we obtain that 


€2n2 en — en?’ 


2 
: :) e War(Zn) _ Var(Ynr) — EY 


EY? 1 a 
onl ol ea?) == &P(|Xq| > x)dx — nP(|Xi| > n), 
O,n 0 


n n 

which converges to 0 since nP(|X,| > n) — 0 (why?). This proves that 
(Z, — EZ,)/n —, 0, which together with (1.83) and the fact that EY,; = 
E(X11,)x,|<n}) imply the result. 

(ii) For the sufficiency, let Y, = XnJIy\x,,|<n}, 2 = 1,2,.... Let m > 0 be an 
integer smaller than n. If we define c; = 7~! for i> m, Z) =--- = Zm_1 = 
0, 2m =Yit-:-+Ym, Z = Yi,1= m+l,...,n, and apply the Hajek-Renyi 
inequality (1.51) to Z;’s, then we obtain that for any « > 0, 


m+ eM, aey 


: a 
i=m4+1 


1 
P ( amas jal >) < 


m<i< 


= 
iM: 
e 


where &, =n! S0j_,(Z; — EZ;) (= n+ _ (Wi — EY;) if | > m). Note 
that 


> EY? 2. 3 SS E(X?1gj-1<|x,\<3}) 
ne n? 
n=1 n=1j=1 


Se GE Xi Lpj-1<1x11<93) 
2 2) ae 


<A> E(\Xi 5-141 x11<73) 


ja 


64 1. Probability Theory 


where the last inequality follows from the fact that 77°, n7? < Aj7* for a 
constant A > 0 and all 7 = 1,2,.... Then, letting n — oo first and m — oo 
next in (1.84), we obtain that 


ae (U sl? a) = tim, imP (py, ll >) 


= 0, 


where the last equality follows from Lemma 1.6. By Lemma 1.4, €);, 4.5. 0. 
Since EY, > EX, n7'S0y_, EY; ~ EX, and, hence, (1.81) holds with 
X;’s replaced by Y;’s and c= EX. It follows from 


S Pas = PG Swe Pal Sn) 00 


n=1 


(Exercise 54) and Lemma 1.5(i) that P (N°, U%_,, {Xm # Ym}) =0, ie., 
there is an event A with P(A) = 1 such that if w € A, then X,(w) = Yn(w) 
for sufficiently large n. This implies 


a or “> ¥, as. 0, (1.85) 


which proves the sufficiency. The proof of (1.82) is left as an exercise. 


We now prove the necessity. Suppose that (1.81) holds for some c € R. 
Then 


Xn Tn — = 4 c) + 2 —a.s, 0. 
nr 


n n n n—-1 


From Exercise 114, X,,/n 4.5, 0 and the i.i.d. assumption on X,,’s imply 


S> P(\Xn| = 2) = 55 P(|Xi| =n) < ov, 

n=1 n=1 
which implies E|.X | < co (Exercise 54). From the proved sufficiency, c = 
EX,. 1 


If E|X1| < oo, then a, in (1.80) converges to E-X, and result (1.80) is 
actually established in Example 1.28 in a much simpler way. On the other 
hand, if E|X | < co, then the stronger result (1.81) can be obtained. Some 
results for the case of E|X | = oo can be found in Exercise 148 in §1.6 and 
Theorem 5.4.3 in Chung (1974). 


The next result is for sequences of independent but not necessarily iden- 
tically distributed random variables. 


1.5. Asymptotic Theory 65 


Theorem 1.14. Let Xj, X2,... be independent random variables with 
finite expectations. 
(i) (The SLLN). If there is a constant p € [1,2] such that 


 E|X;|? 
SEE 2s (1.86) 
qP 
i=1 
then aya 
—S "(Xi — EX) Sas. 0. (1.87) 
nm 


i=1 
(ii) (The WLLN). If there is a constant p € [1,2] such that 


lim — S°E|X;/? = 1.88 
jim. = » Xi? = (1.88) 
then it 
—S "(Xi — EX;) -» 0. (1.89) 
nr 
i=1 


Proof. (i) Consider again the truncated X;: Yn = Xnly\x,\<n}, 2 = 
1,2,.... Since X2Iy)x,j<n} < n?-?|Xal?, 


BY? SEO iia) ~~ BEX 
Dag ga ge 


n=1 n=1 n=1 


It follows from the proof of Theorem 1.13(ii) that n~! S>""_ | (¥;- EY) > 
0. Also, 


VPOG Fy). Palen) = p< 00. 
n=1 n=1 


Hence, it follows from the proof of Theorem 1.13(ii) that (1.85) holds. 
Finally, 


SS |E(Xn — Yn E(|XalL és © EIX,|? 
yee lies? Alxelent) DD ee 


n=1 n=1 n=1 


which together with Lemma 1.6 imply that n~! S>"_, |E(X; —Y;)| > 0 and 
thus (1.87) holds. 

(ii) For any € > 0, an application of Chebyshev’s inequality and inequality 
(1.44) leads to 


1 
P (; jeg <5 sax EX,/?, 


which converges to 0 under (1.88). This proves (1.89). Il 


n 


S > (Xj — EX;)| > 


i=l 


66 1. Probability Theory 


Note that (1.86) implies (1.88) (Lemma 1.6). The result in Theorem 
1.14(i) is called Kolmogorov’s SLLN when p = 2 and is due to Marcinkiewicz 
and Zygmund when 1 < p < 2. An obvious sufficient condition for (1.86) 
with p € (1, 2] is sup,, E|Xy|? < cw. 

For dependent random variables, a result for Markov chains introduced 
in §1.4.4 is discussed in §4.1.4. We now consider martingales studied in 
§1.4.4. First, consider the WLLN. Inequality (1.44) still holds if the inde- 
pendence assumption of X;,’s is replaced by the martingale assumption on 
the sequence {57i_,(X;—EX;)} (why?). Hence, from the proof of Theorem 
1.14(ii) we conclude that (1.89) still holds if the independence assumption 
of X;’s in Theorem 1.14 is replaced by that {$77 ,(X;- £X;)} is a martin- 
gale. A result similar to the SLLN in Theorem 1.14(i) can be established 
if the independence assumption of X;’s is replaced by that the sequence 
{S7_, (X; — EX;)} is a martingale and if condition (1.86) is replaced by 

a P 

FB XeP arn Xed) cy ae, 

n=2 
which is the same as (1.86) if X,’s are independent. The proof of this 
martingale SLLN and many other versions of WLLN and SLLN can be 
found in standard probability textbooks, for example, Chung (1974) and 
Loéve (1977). 

The WLLN and SLLN have many applications in probability and statis- 
tics. The following is an example. Other examples can be found in later 
chapters. 


Example 1.32. Let f and g be continuous functions on [0,1] satisfying 
0 < f(x) < Cg(a) for all z, where C' > 0 is a constant. We now show that 


im ff [ae dey dag: dry = oS (ahd uv (1.90) 


n—0o fo 9f xr) 


(assuming that fo 9 x)dx #0). Let X1, Xo,... be ii.d. random variables 
ie the uniform inn on a 1]. By Theorem 1.2, E[f(X1)] = 


fF x)dx < oo and Elg = fo 9 x)dx < oo. By the SLLN (Theorem 
1 as 


= S> f(Xi) as. Elf(X1)I, 


i=l 


and the same result holds when f is replaced by g. By Theorem 1.10(i), 


cia f(X) _, ElfG)) (1.91) 


1.5. Asymptotic Theory 67 


Since the random variable on the left-hand side of (1.91) is bounded by C, 
result (1.90) follows from the dominated convergence theorem and the fact 
that the left-hand side of (1.90) is the expectation of the random variable 
on the left-hand side of (1.91). 


Moment inequalities introduced in §1.3.2 play important roles in prov- 
ing convergence theorems. They can also be used to obtain convergence 
rates of tail probabilities of the form P (|n~1 )>7_,(X; — EX;)| >t). For 
example, an application of the Esseen-von Bahr, Marcinkiewicz-Zygmund, 
and Chebyshev inequalities produces 


1 O(n!) ifl<p<2 
= i < 
r (|b Soex EX;) >) <{ Ofn-¥/) fp >2 


t=1 


for independent random variables X,...,X», with sup, E|Xn|? < oo. 


1.5.5 The central limit theorem 


The WLLN and SLLN may not be useful in approximating the distributions 
of (normalized) sums of independent random variables. We need to use the 
central limit theorem (CLT), which plays a fundamental role in statistical 
asymptotic theory. 


Theorem 1.15 (Lindeberg’s CLT). Let {X,,;,7 = 1,..., kn} be independent 
random variables with 0 < 0? = Var(354 Mah) S00; = 1, 2,02, and 
kn 7 oO asn— oo. If 


= 


E [(Xnj — EXng)’Iq1x,3-EXnjl>eon}] = (0%) for any € > 0, (1.92) 
1 


J 
then 
meee 
= SE) Sg NO (1.93) 
Se 
Proof. Considering (Xn; — EXnj;)/on, without loss of generality we may 
assume E'X,,; = 0 and o2 = 1 in this proof. Let t € R be given. From the 
inequality JeV~" — (1 + /—Itx — t?x?/2)| < min{|tz|?, |ta|?}, the ch.f. of 
Xnj Satisfies 


ox,,(t) — (1— 70%, /2)| < B(min{|tXnjl?, [tXnjl?}) , (1.94) 


where 02, = Var(X,,,;). For any € > 0, the right-hand side of (1.94) is 


nj 


bounded by E(|tXnj|* Ly xn5|<e}) + E(|tXng|?Lxnjl>ey) which is bounded 


68 1. Probability Theory 


by ¢lt)?o7, + PE(XZ;Iqx,,/><). Summing over j and using condition 
(1.92), we obtain that 


Kr, 


yy ox, (t) — (1— #02, /2) | 3 (1.95) 


By condition (1.92), Maxj<k, o2 7 e + Maxj<k, E(X2 1 xnjl>h) 7 
for arbitrary « > 0. Hence 
os 
lim max —2 = 0. (1.96) 


N00 j<kn OF 


(Note that 07, = 1 is assumed for convenience.) This implies that 1—t?07, 
are all between 0 and 1 for large enough n. Using the inequality 


m 
Jaa +++ Gm — br +++ Bml < S> Jag = b5| 
j=l 


for any complex numbers a,’s and 6,’s with |a;| < 1 and |b;| < 1, 7 = 
1,...,m, we obtain that 


g-¥ en; [2 = (1- to aD, 


which is bounded by ¢* _ of; < t* maxjcx, 02; — 0, since |e” —1—2| < 
x? /2 if |z| < 4 and pp o2, = 07, =1. Also, 


kn kn 
[[ 4x... - [][ @- 6%, /2) | 
j=l j=l 


is bounded by the quantity on the left-hand side of (1.95) and, hence, 
converges to 0 by (1.95). Thus, 


kn 
I ox,,(t) =][e —Ponj/2 4 o(1) =e /? 4 (1). 


&. 
Il 
fan 


This shows that the ch.f. of at Xn ; converges to the ch.f. of N(0, 1) for 
every t. By Theorem 1.9(ii), the result follows. 


Condition (1.92) is called Lindeberg’s condition. From the proof of 
Theorem 1.15, Lindeberg’s condition implies (1.96), which is called Feller’s 


condition. Feller’s condition (1.96) means that all terms in the sum 02 = 


1.5. Asymptotic Theory 69 


eeu On, are uniformly negligible as n — oo. If Feller’s condition is as- 
sumed, then Lindeberg’s condition is not only sufficient but also necessary 
for result (1.93), which is the well-known Lindeberg-Feller CLT. A proof can 
be found in Billingsley (1986, pp. 373-375). Note that neither Lindeberg’s 


condition nor Feller’s condition is necessary for result (1.93) (Exercise 158). 


A sufficient condition for Lindeberg’s condition is the following Lia- 
pounov’s condition, which is somewhat easier to verify: 


kn 
N° B|Xnj — EXnj|?*? = 0(02**) for some 6 > 0. (1.97) 


j=1 


Example 1.33. Let X1, X2,... be independent random variables. Suppose 
that X; has the binomial distribution Bi(p;,1), i = 1,2,..., and that 0? = 
Soe, Var(X;) = 07, pi(1 — pi) co as n — oo. For each i, EX; = 
p, and E|X; — EX;|? = (1 — pi)?pi + p?(1 — pi) < 2pi(1 — p;). Hence 
yi, E|X; — EX;|° < 202, i.e., Liapounov’s condition (1.97) holds with 
6 = 1. Thus, by Theorem 1.15, 


— Um ag as N (04), (1.98) 


It can be shown (exercise) that the condition 0, — 00 is also necessary for 
result (1.98). 0 


The following are useful corollaries of Theorem 1.15 (and Theorem 
1.9(iii)). Corollary 1.2 is in fact proved in Example 1.28. The proof of 
Corollary 1.3 is left as an exercise. 


Corollary 1.2 (Multivariate CLT). Let X1,...,.X, be iid. random k- 
vectors with a finite © = Var(X,). Then 


= S°(Xi — BX) 3a Nx (0,E). 0 


Corollary 1.3. Let X,; € R™, 7 = 1,...,kn, be independent random 
vectors with m; < m (a fixed integer), n = 1, 2,..., kn — 00 as n — oo, and 
inf; », A_[Var(Xni)] > 0, where A_[A] is the smallest eigenvalue of A. Let 
Cni € R™: be vectors such that 


kin 
lim ( max ln / Stel =i; 
noo \ 1<i<kn = 


70 1. Probability Theory 


(i) Suppose that sup, ,, E||Xni||?*+® < co for some 6 > 0. Then 


kn 
Sh i(Xni — EX) i 


i=1 


ae 1/2 
>a vine) — 4 N(0,1). (1.99) 
i=1 


(ii) Suppose that whenever m;=m,, 1<i<j<kn,n=1,2,..., Xni and Xp; 
have the same distribution with E]|X,,;|/? < oo. Then (1.99) holds. # 


Applications of these corollaries can be found in later chapters. 

An extension of Lindeberg’s CLT is the so-called martingale CLT. In 
Theorem 1.15, if the independence assumption of Xnj;, j = 1,...,kn, is 
replaced by that {Y;,} is a martingale and 

ae 
= S > El(Xng — EXng)?|Xnry 9 Xng—1)] Op 1, 


nM j=1 


where Y;, = yi ey — EX,;) when n < kn, Yn = Yz,, when n > k,, and 
Xno is defined to be 0, then result (1.93) still holds (see, e.g., Billingsley, 
1986, p. 498 and Sen and Singer 1993, p. 120). 


More results on the CLT can be found, for example, in Serfling (1980) 
and Shorack and Wellner (1986). 


Let Y;, be a sequence of random variables, {j,} and {a} be sequences 
of real numbers such that o,, > 0 for all n, and (Y;, — tn)/on a N(0,1). 
Then, by Proposition 1.16, 


lim sup |Fy,—n,)/o,(£) — ®(x)| = 0, (1.100) 


where ® is the c.d.f. of N(0,1). This implies that for any sequence of real 
numbers {cn}, limn—oo |P(Yn < en) — (=n) = 0,i., P(Y, < cn) can 
be approximated by ®(“2=), regardless of whether {c,} has a limit. Since 


On 


&(+#~) is the c.d.f. of N(n, 02), Yn is said to be asymptotically distributed 


On 


as N({in,o2) or simply asymptotically normal. For example, 5 Ch Xena 
in Corollary 1.3 is asymptotically normal. This can be extended to ran- 
dom vectors. For example, 5>j_, X; in Corollary 1.2 is asymptotically 


distributed as N;,(nEX1,nB). 


1.5.6 Edgeworth and Cornish-Fisher expansions 


Let {Y,,} be a sequence of random variables satisfying (1.100) and W,, = 
(Yn — fn)/On. The convergence speed of (1.100) can be used to assess 
whether ® provides a good approximation to the c.d.f. Fiw,,. Also, some- 
times we would like to find an approximation to Fw,, that is better than 


1.5. Asymptotic Theory 71 


® in terms of convergence speed. The Edgeworth expansion is a useful tool 
for these purposes. 

To illustrate the idea, let W, = n~!/2 7"_, (Xi—p)/o, where X1, Xo,... 
are iid. random variables with EX, = py ma Var(X1) = 07. Assume that 
the m.g.f. of Z = (X1 — s)/c is finite and positive in a neighborhood of 0. 
From (1.55), the cumulant generating function of Z has the expansion 


CO 


Ki= De ae 


j=1 


where «;, j = 1,2,..., are cumulants of Z (e.g., 1 = 0, Ko = 1, K3 = EZ, 
and k4 = EZ* — 3), and the m.g.f. of W;, is equal to 


dn(t) = [exptatt/ vig}]" = exp {5 Dy at}. 


where exp{x} denotes the exponential function e*. Using the series expan- 


sion for e’ /?, we obtain that 
Un(t) = eP /? + nV? py (the? /? 4... 4 nF 9y (tle /2 +---, (1.101) 
where r; is a polynomial of degree 37 depending on k3,...,4j+2 but not on 
n,j =1,2,.... For example, it can be shown (exercise) that 
ri(t) = gK3t® and ro(t) = gratt + ARgt®. (1.102) 
Since y(t) = f e!dFw, (x) and e/? = f e’d®(x), expansion (1.101) 


suggests the inverse expansion 
Fw, (x) = ®(x) + n—'/? Ry (a) nereeitandle n—4/? Ri (a) be 


where R,(x ) is a function satisfying f e"dR;(x) = r;(t)e O72 jf = 1,2,... 
Let V2 = ey ; be the differential operator and V = V!. Then Rj(x) = 

r;(—V)®(a), c= = 1,2,..., where r;(—V) is interpreted as a differential op- 
erator. Thus, R;’s can be obtained once r;’s are derived. It follows from 
(1.102) exercide) that 


Ry (x) = —#K3(x? — 1)'(z) (1.103) 
and 


Ro(x) = —[pprav(z? — 3) + senga(x* — 10x? + 15)]6'(z2). (1.104) 


7K 


A rigorous statement of the Edgeworth expansion for a more general W,, 
is given in the following theorem whose proof can be found in Hall (1992). 


72 1. Probability Theory 


Theorem 1.16 (Edgeworth expansions). Let m be a positive integer and 
Xo,... be i.i.d. random k-vectors having finite m+2 moments. Consider 
W, = Vnh(X)/on, where X = n~'S>"_, Xj, h is a Borel function on 
R* that is m+ 2 times continuously differentiable in a neighborhood of 
p= EX,, h(u) = 0, and of? = [Vh(u)]7 Var(X1)VA(u) > 0. Assume that 


lim sup |@x, (t)| < 1, (1.105) 


lll +00 


where $y, is the ch.f. of X;. Then, Fy, admits the Edgeworth expansion 


= (=z) (1.106) 


where p;(x) is a polynomial of degree at most 37 — 1, odd for even j and 
even for odd j, with coefficients depending on the first m+ 2 moments of 
X 1, j =1,...,m. In particular, 


m 


sup Fw,, ( ae pio) 


pi(xz) = —c10;,| + 6 ‘co, °(x? — 1) (1.107) 


with c) = 2-1 1Ly= 1 Oigbig and C2 = Dy pay Dyer Beagaypagr + 
ay." i se aie i es 1 UA; Ain fiifjn, Where a; is the ith component of 
Vh(u), aij is the (i,7)th element of the Hessian matrix V7h(u), wij = 
E(YiY;), wigt = E(ViY;Yi), and Y; is the ith component of X;—-p. I 


Condition (1.105) is Cramér’s continuity condition. It is satisfied if one 
component of X; has a Lebesgue p.d.f. The polynomial p; with 7 > 2 
may be derived using the method in deriving (1.103) and (1.104), but the 
derivation is usually complicated (see Hall (1992)). 

Under the conditions of Theorem 1.16, the convergence speed of (1.100) 
is O(n—'/?) and, as an approximation to Fy, , O41 n—J/2y,0! is better 
than ©, since its convergence speed is o(n~"”/2). 


The results in Theorem 1.16 can be applied to many cases, as the fol- 
lowing example indicates. 


Example 1.34. Let X = n7!S7i_, X; with iid. random variables X1, Xo, 
satisfying condition (1.105). First, consider the normalized random 
variable W, = /n(X — y)/o, where wp = EX, and o? = Var(X1). Then, 
Theorem 1.16 can be applied with h(x) = x — yw and of = o?, and the 
Edgeworth expansion in (1.106) holds if E|X,|™*t? < oo. In this case, 
results (1.103) and (1.104) imply that p;(x) = R;(x)/®’(x), 7 = 1,2. 
Next, consider the studentized random variable W, = V/n(X — p)/G, 
where 6? =n! 3°", (X; —X)?. Assuming that EX7™** < 00 and apply- 
ing Theorem 1.16 to random vectors (X;,X?), i = 1,2,..., and h(a,y) = 


1.5. Asymptotic Theory 73 


rp y — x), we obtain the Edgeworth expansion (1. with op, = 1, 
7 btain the Ed h i 1.106) with 1 
pi(@) = $K3(22? + 1) 
exercise). Furthermore, it can be found in Ha ,?p- that 
i Furth i be found in Hall (1992 73) th 
po(a) = Kax(x? — 3) — ak3a(x4 + 2x? — 3) — F2(x? +3). 


Consider now the random variable \/n(é? — 07). Theorem 1.16 can be 
applied to random vectors (X;, X?), i = 1,2,..., and h(x, y) = (y—2?—o7). 
Assume that BX?""** < oo. It can be shown (exercise) that the Edgeworth 
expansion in (1.106) holds with W,, = /n(6é? —0?)/on, of = E(X1-p)*- 
o*, and 


pila) = (v4 — I-M2[1 — 2(vg — 1)“ (46 — 314 — 602 + 2)(@? — 1)], 


where v; = 0 JE(X1 — pw), j = 3,...,6. 

Finally, consider the studentized random variable W, = /n(a? —07)/7, 
where 72 = n~! 37, (X; — X)4 — 6*. Theorem 1.16 can be applied to 
random vectors (X;,X?, X?, X#), i=1,2,..., and 


h(a, y,z,w) = (y — 2? — 07) [w — y? — daz + 8a7y — 4x4] 1/2. 


Assume that EX}"*® < oo. It can be shown (exercise) that the Edgeworth 
expansion in (1.106) holds with o? = 1 and 


(4v3 +v4—ve) + 5(3¥3+3v4—Ve—2)(a?—-1)]. 0 


An inverse Edgeworth expansion is referred to as a Cornish-Fisher 
expansion, which is useful in statistics (see §7.4). For a € (0,1), let 
Zq = ® l(a). Since the c.d.f. Fw, may not be strictly increasing and 
continuous, we define wpq = inf{x : Fy, (x) > a}. The following result 
can be found in Hall (1992). 


Theorem 1.17 (Cornish-Fisher expansions). Under the conditions of The- 
orem 1.16, wna admits the Cornish-Fisher expansion 


sup |wWna —za—->~ qj(%a)| _ (=) (1.108) 


e<a<l—e 
where € is any constant in (0, 4) and q,;’s are polynomials depending on p,’s 
in (1.106). 0 


The polynomials in (1.108) can be determined using results (1.106) and 
(1.108). We illustrate it by deriving q; and gz. Without loss of generality, 


74 1. Probability Theory 


assume that Fw, (wna) = @ (why?). Using (1.106), (1.108), Taylor’s ex- 
pansions at zq for ®(Wna), Pi(Wna)®’ (Wra), and p2(Wna)®’(Wne), and the 
fact that 6” (x) = —x®'(x), we obtain that 

= O(Wna) + nm, (Wna)®’ (Wna) + n ‘po (Wna)®" (Wna) 

_ (za) + {n-/2q (2a) + n~'qo(Za) _ $ [nq (za) ]? Za} ® (Za) 

+ nV? fp (2a) +07? Q1 (Za) [Pi (Za) — ZaP1 (Za)]}B" (za) 
+ n~'po(za)®! (za) + o(n71) 
=a+n lq (20) + pr(za))® (Za) +27 {d2(Za) — $2al¢1 (20)]? 
ies 1 (Ze) [p', (Za) — ZaP1(Ze)| + p2(Za)} ®' (Za) or o(n-*). 


Ignoring terms of order o(n~!), we conclude that 
g(a) = —pi(@) 


and 
go(x) = pi(a)p' (a) — 5a[pr (a)? — pala). 


Edgeworth and Cornish-Fisher expansions for W,, in Theorem 1.16 
based on non-i.i.d. X;’s or for other random variables can be found in Hall 
(1992), Barndorff-Nielsen and Cox (1994), and Shao and Tu (1995). 


1.6 Exercises 


1. Let A and B be two nonempty proper subsets of a sample space 
2, AA Band ANB# 9. Obtain o({A, B}), the smallest o-field 
containing A and B. 


2. Let C be a collection of subsets of 0 and let T = {F : F is a o-field 
on 2 and C Cc F}. Show that [ 4 0 and o(C) = OrerFf. 


3. Let (Q,F;), 7 = 1,2,..., be measurable spaces such that F; C F341, 
j =1,2,.... Is U;F; a o-field? 


4. Let C be the collection of intervals of the form (a,b], where —oo < 
a<b< «o, and let D be the collection of closed sets on R. Show 
that B = o(C) = o(D), where B is the Borel o-field on R. 


5. (7- and -systems). A class C of subsets of Q is a a-system if and 
only if AEC and BEC imply ANB €C. A class D of subsets of 2 
is a A-system if and only if (i) 2 € D, (ii) A € D implies A®° € D, and 
(iii) A; € D, 7 =1,2,..., and A,’s are disjoint imply that U,A; € D. 
(a) Show that if C is a a-system and D is a A-system, then C C D 
implies o(C) C D. 


1.6. Exercises 79 


11. 


12. 


13. 


14. 
15. 


16. 


17. 


(b) Show that D is a A-system if and only if the following conditions 
hold: (i) QE D, (ii) AE D, BED, and Ac B imply ASN BED, 
and (iii) A; € D and A; C Aj41, j =1,2,..., imply U;A; € D. 


. Prove part (ii) and part (iii) of Proposition 1.1. 


. Let 4, 1 = 1,2,..., be measures on (Q,F) and a;, i = 1,2,..., be 


positive numbers. Show that a1, +a9¥2+--- is a measure on (0, F). 


. Let {A,,} be a sequence of events on a probability space (Q, F, P). 


Define limsup,, An = NPL, UX, Ai and liminf, An = UP, MPL, Ai. 
Show that P(liminf, A,) < liminf, P(A,) and limsup,, P(An) < 
P(limsup,, An). 


. Prove Proposition 1.2. 


10. 


Let F(2x1,...,v~) be ac.df. on R*. Show that 

(8) Pty ante, th) SF Pigg tea eile So. 

(b) limg, 5-0 F(a1,...,%~) =0 for any 1 <i<k. 

(c) F(a1,.-., 24-1, 00) =limg, +00 F (a1,.--;2k-1, Zk) is. ac.d.f. on R*, 


Let (Q;,F;) = (R,B), i = 1,...,k. Show that the product o-field 
o(Fi x --- x Fx) is the o-field generated by all open sets in R”. 


Let v and X be two measures on (Q,F) such that v(A) = A(A) for 
any A €C, where C C F and C is a z-system (ie., if A and B are in 
C, then so is AM B). Assume that there are A; € C, i = 1,2,..., such 
that UA; = Q and v(A;) < oo for all 7. Show that v(A) = (A) for 
any A € o(C). This proves the uniqueness part of Proposition 1.3. 
(Hint: show that {A € o(C) : v(A) = A(A)} is a o-field.) 


Let f be a function from 2 to A. Show that 
(a) f-*(B°) = (f-*(B))° and f~*(UB;) = Uf~* (Bi); 
(b) o(f-1(C)) = f-(o(C)), where C is a collection of subsets of A. 


Prove Proposition 1.4. 


Show that a monotone function from FR to F is Borel and a c.d.f. on 
R* is Borel. 


Let f be a function from (Q,F) to (A,G) and Aj, Ag,... be disjoint 
events in F such that UA; =. Let f, be a function from (An, Fa,,) 
to (A,G) such that fn(w) = f(w) for any w € Ay, n = 1,2,.... Show 
that f is measurable from (Q,F) to (A,G) if and only if f, is mea- 
surable from (An, Fa, ) to (A,G) for each n. 


Let f be a nonnegative Borel function on (Q,F). Show that f is the 
limit of a sequence of simple functions {yn} on (Q,F) with 0 < yi < 
go S--- Sf, 


76 


18. 


19. 


20. 


21. 
22. 
23. 


24. 


25. 
26. 
27. 
28. 


29. 


1. Probability Theory 


Let | baa cere F;) be a product measurable space. 

(a) Let 7; be the ith projection, ie., 7;(w1,...,Wk) = wi, wi © Oi, 
i=1,...,k. Show that 7,...,7, are measurable. 

(b) Let f be a function on Tes Q; and g;(wi) = f(wi,..., Wi, Wk); 
where w; is a fixed point in Q;, 7 =1,...,k but 7 #7, andi =1,...,k. 
Show that if f is Borel on Tk, (:, Fi), then gi,...,gz are Borel. 

(c) In part (b), is it true that f is Borel if gi, ...,g, are Borel? 


Let {fn} be a sequence of Borel functions on a measurable space. 
Show that 

(a) o( fi, fas --.) = 0 (Uio(f)) = 9 (UR e(fis fi) 

(b) o(limsup,, fn) C N° o( fas fnti; +) 


(Egoroft’s theorem). Suppose that {f;,} is a sequence of Borel func- 
tions on a measure space (Q,F,v) and fr(w) > f(w) for w € A with 
v(A) < oo. Show that for any « > 0, there isa B C A with p(B) <« 
such that fn(w) > f(w) uniformly on AN BS. 


Prove (1.14) in Example 1.5. 
Prove Proposition 1.5 and Proposition 1.6(i). 


Let 1;, 1 = 1,2, be measures on (Q,F) and f be Borel. Show that 


[se (v1 + V2) = f fant f favs, 


i.e., if either side of the equality is well defined, then so is the other 
side, and the two sides are equal. 


Let f be an integrable Borel function on (Q, F, v). Show that for each 
€ > 0, there is a 6, such that v(A) < 6, and A € F imply J, |f|dv <. 


Prove that part (i) and part (iii) of Theorem 1.1 are equivalent. 
Prove Theorem 1.2. 
Prove Theorem 1.3. (Hint: first consider simple nonnegative f.) 


Consider Example 1.9. Show that (1.17) does not hold for 


1 L=4 
f%sj)=4 -1  t=j-1 
0 otherwise. 


Does this contradict Fubini’s theorem? 


Let f be a nonnegative Borel function on (Q,F,v) with a o-finite 
vy, A={(w,27) €ENxR:0< a < f(w)}, and m be the Lebesgue 
measure on (R, B). Show that A € o(F x B) and f, fdv =v x m(A). 


1.6. Exercises res 


30. 
bl. 


32. 


33. 


34. 
35. 


36. 


37. 


38. 


For any c.d.f. F and any a > 0, show that [[F(x +a) — F(«)|dx = a. 


(Integration by parts). Let F' and G be two c.d.f.’s on R. Show that if 
F and G have no common points of discontinuity in the interval (a, b], 
then Soa.t] G(a)dF (x) = F(b)G(b) — F(a)G(a) — Sat) F(«)dG(z). 


Let f be a Borel function on R? such that f(x,y) = 0 for each x € R 
and y ¢ Cz, where m(C,) = 0 for each x and m is the Lebesgue 
measure. Show that f(x,y) = 0 for each y ¢ C and « ¢ By, where 
m(C) =0 and m(B,) = 0 for each y ¢ C. 


Consider Example 1.11. Show that if (1.21) holds, then P(A) = 
J, f(x)dx for any Borel set A. (Hint: A= {A: P(A) = f, f(x)dx} 
is a o-field containing all sets of the form (—oo, 2].) 


Prove Proposition 1.7. 


Let {a,} be a sequence of positive numbers satisfying \77~, an = 1 
and let {P,,} be a sequence of probability measures on a common 
measurable space. Define P = >>, anPn. 

(a) Show that P is a probability measure. 

(b) Show that P, < v for all n and a measure v if and only if P< v 
and, when P < v and rv is o-finite, © = 7, a, 4. 

(c) Derive the Lebesgue p.d.f. of P when P,, is the gamma distribution 
T'(a,n~+) (Table 1.2) with a > 1 and a, is proportional to n~°. 


Let F; be a c.d.f. having a Lebesgue p.d-f. f;, i = 1,2. Assume that 
there is ac € R such that Fi (c) < Fo(c). Define 


_ f Fi(z) —o<“<e 
Fe) ={ F(x) CK BK Om. 


Show that the probability measure P corresponding to F' satisfies 
P<m+ 4, and find dP/d(m + 6.), where m+ é, is given in (1.23). 


Let (X,Y) be a random 2-vector with the following Lebesgue p.d.f.: 


_ fj 8ary O<a<y<l 
f(@,9) = { 0 otherwise. 


Find the marginal p.d.f.’s of X and Y. Are X and Y independent? 


Let (X,Y, Z) be a random 3-vector with the following Lebesgue p.d.f.: 


81 
0 otherwise. 


f(x,y, 2) Pa 


Show that X, Y, and Z are not independent, but are pairwise inde- 
pendent. 


{ josingsingsing (jes oa,0< ys 20,0 < 2X 20 


78 


39. 
40. 


Al. 


42. 
43. 


44. 


45. 


46. 


AT. 


A8. 


A9. 


50. 


dl. 


1. Probability Theory 


Prove Lemma 1.1 without using Definition 1.7 for independence. 


Let X be a random variable having a continuous c.d.f. F. Show that 
Y = F(X) has the uniform distribution U(0,1) (Table 1.2). 


Let U be a random variable having the uniform distribution U(0, 1) 
and let F be a c.d-f. Show that the c.d.f. of Y = F~!(U) is F, where 
Foe) Smee Re Pa) > th. 


Prove Proposition 1.8. 


Let X = N;,(y, ©) with a positive definite D. 

(a) Let Y = AX +c, where A is an | x k matrix of rank | < k and 
c € R!. Show that Y has the N;(Ay +c, AXA™) distribution. 

(b) Show that the components of X are independent if and only if 
is a diagonal matrix. 

(c) Let A be positive definite and Y = N,,(7, A) be independent of 
X. Show that (X,Y) has the Nz4m((“, 7), D) distribution, where D 
is a block diagonal matrix whose two diagonal blocks are © and A. 


Let X be a random variable having the Lebesgue p.d_f. 25 [(0,n) (2). 
Derive the p.d.f. of Y = sin X. 


Let X;, 2 = 1, 2,3, be independent random variables having the same 
Lebesgue p.d.f. f(a) = e~*1(9,00)(“). Obtain the joint Lebesgue p.d-f. 
of (Yi, Yo, Ys), where Y, = XxX, + X2 + X3, Y = X1/(X1 + Xo), and 
Y3 = (X1 + X2)/(X1 + Xo + X3). Are Y;’s independent? 


Let X, and X2 be independent random variables having the stan- 
dard normal distribution. Obtain the joint Lebesgue p.d.f. of (Y1, Y2), 
where Y; = \/X?7 +X and Y2 = X1/X2. Are Y;’s independent? 


Let X, and X2 be independent random variables and Y = X; + Xo. 
Show that Fy(y) = f Fx,(y — x)dFx, (2). 


Show that the Lebesgue p.d.f.’s given by (1.31) and (1.33) are the 
p.d.f.’s of the y?(5) and Fy, .n,(6) distributions, respectively. 


Show that the Lebesgue p.d.f. given by (1.32) is the p.d-f. of the t,,(6) 
distribution. 


Let X = N,,(u,L,) and A be an n x n symmetric matrix. Show that 
if X7 AX has the y2(d) distribution, then A* = A, r is the rank of A, 
and 6 = 7 Au. 


Let X = N,(u, In). Apply Cochran’s theorem (Theorem 1.5) to show 
that if A2 = A, then X7 AX has the noncentral chi-square distribution 
x?(6), where A is an n X n symmetric matrix, r is the rank of A, and 
6=p' Ap. 


1.6. Exercises 79 


52. 


53. 


54. 


59. 


56. 


57. 


58. 


59. 


60. 


61. 


Let X1,...,Xn be independent and X; = N(0,07), i = 1,...,n. Let 
X= ih 7 Xi/ Di a7” and S? = YO, 07 7(Xi — X)?. Apply 
Cochran’s theorem to show that X2 and S$? are independent and that 


S? has the chi-square distribution y?_,. 


Let X = N,,(u,I,) and A; be an n x n symmetric matrix satisfying 
A? = Aj, i = 1,2. Show that a necessary and sufficient condition that 
X7A,X and X7 AX are independent is A, Az = 0. 


Let X be a random variable and a > 0. Show that E|X|* < oo if and 
only if 9°. n? 1 P(|X| > n) < co. 


Let X be a random variable. Show that 
(a) if EX exists, then EX = f° P(X > x)dx — f°. P(X < x)da; 
(b) if X has range {0,1,2,...}, then EX = 3°, P(X >7n). 


Let T be a random variable having the noncentral t-distribution t,, (0). 
Show that 
(a) E(T) = oP ((n — 1)/2).\/n/2/T(n/2) when n > 1; 
2 
(b) Var(T’) = ute) - ei | when n > 2. 
Let F be a random variable having the noncentral F-distribution 


Fryno(d). Show that 
(a) E(F) = 245 when nz > 2; 


(b) Var(F) = 2n3[(n1+6)?+(n2—2)(n1+26)] 


nz (n2—2)?(n2—4) when n2 > 4. 


Let X = N;,(w, £2) with a positive definite D. 

(a) Show that EX = yu and Var(X) =. 

(b) Let A be an 1 x & matrix and B be an m x k matrix. Show that 
AX and BX are independent if and only if ANB™ = 0. 

(c) Suppose that k = 2, X = (X1, X2), w= 0, Var(X1) = Var(X2) = 
1, and Cov(X,, X2) = p. Show that E(max{X 1, X2}) = /(1 - p)/t. 


Let X be a random variable and g and h be nondecreasing functions 
on R. Show that Cov(g(X),h(X)) > 0 when Elg(X)h(X)| < oo. 


Let X be arandom variable with EX? < co and let Y = |X|. Suppose 
that X has a Lebesgue p.d.f. symmetric about 0. Show that X and 
Y are uncorrelated, but they are not independent. 
Let (X,Y) be a random 2-vector with the following Lebesgue p.d.f.: 
at ety? <1 
jeia, ae, 


Show that X and Y are uncorrelated, but are not independent. 


80 


62. 


63. 


64. 


65. 


66. 


67. 


1. Probability Theory 


Show that inequality (1.41) holds and that when 0 < E|X|? < oo and 
0 < E|Y|4 < ow, the equality in (1.40) holds if and only if a|X|? = 
G\Y|2 a.s. for some nonzero constants a and (. 


Prove the following inequalities. 

(a) Liapounov’s inequality (1.42). 

(b) Minkowski’s inequality (1.43). (Hint: apply Hélder’s inequality 
to random variables |X + Y|?~+ and |X|.) 

(c) (C,-inequality). E|X+Y |" < C,(E|X|"+E|Y|"), where X and Y 
are random variables, r is a positive constant, and C, = Llif0O<r<1l 
and O, = 2"! ifr > 1. 

(d) Let X; be a random variable with E|X;|? < oo, i =1,...,n, where 
p is a constant larger than 1. Show that 


1 n 

E\— Xx; 
ne 

(e) Inequality (1.44). (Hint: prove the case of n = 2 first and then 


use induction.) 
(f) Inequality (1.49). 


n 


P i 1 Molle 
: foal ; = ; / 
< nin 3 B E|X;)”, E ) (E|X;|?) ' : 


i=l 


Show that the following functions of x are convex and discuss whether 
they are strictly convex. 

(a) ja —al?, where p> 1 andae R. 

(b) a~?, x € (0,00), where p > 0. 

(c) e©, where cE R. 

(d) xloga, x € (0,0). 

(e) g(y(x)), x € (a,b), where —co <a <b < ~, y is convex on (a,b), 
and g is convex and nondecreasing on the range of y. 

f) v(x) = pues cipi(a;i), © = (@1,...,2p) € ine X;, where c; is a 
positive constant and y; is convex on X;, i = 1,...,k. 


Let X = N;,(p, £2) with a positive definite U. 

(a) Show that the m.g.f. of X is eet! 44/2, 

(b) Show that EX = w and Var(X) = © by applying (1.54). 

(c) When k = 1 (© = 07), show that EX = W4(0) = pw, EX? = 
wi (0) =02+y?2, BX = © (0) = 302n +13, and EX4 =p (0) = 
304 + 607 pu? + pt. 

(d) In part (c), show that if ~ = 0, then EX? = 0 when p is an odd 
integer and EX? = (p—1)(p—3)---3-1o? when p is an even integer. 


Let X be a random variable having the gamma distribution (a, y). 
Find moments EX”, p= 1,2,..., by differentiating the m.g.f. of X. 


Let X be a random variable with finite He’* and Ee~'* for at 40. 
Show that E|X|* < oo for any a > 0. 


1.6. Exercises 81 


68. 


69. 


70. 


71. 


72. 


73. 


74. 


75. 


76. 


Let X be arandom variable having wx (t) < co for t in a neighborhood 
of 0. Show that the moments and cumulants of X satisfy the following 
equations: 1 = 1, fg = Ko + Ky, U3 = K3 + 3K1K2 + KP, and 
fa = Ka t+ 3K3 + 46143 + 6K? K2Q + KY, Where py; and «; are the ith 
moment and cumulant of X, respectively. 


Let X be a discrete random variable taking values 0,1,2.... The proba- 
bility generating function of X is defined to be px(t) = E(t*). Show 
that 

(a) px(t) = wx (logt), where ax is the m.g.f. of X; 

(b) Pex) = E[X(X —1)---(X —p+1)] for any positive integer 
p, if px is finite in a neighborhood of 1. 


Let Y be a random variable having the noncentral chi-square distri- 
bution x7(6). Show that 

(a) thech:f..of ¥ is (1 — 27=1)-*2ev ev 

(b) E(Y) =k+6 and Var(Y) = 2k +4 406. 


Let ¢ be a ch.f. on R*. Show that |d| < 1 and ¢ is uniformly contin- 
uous on R*, 


For a complex number z = a+./—10, where a and 6 are real numbers, 
z is defined to be a— /—1b. Show that S>7"_, eT b(t; —tj)ziZ; > 0, 
where @ is a ch.f. on R*, ty,...,t, are k-vectors, and 21,..., 2m are 
complex numbers. 


Show that the following functions of t € R are ch.f.’s, where a > 0 
and b > 0 are constants: 
(a) a*/(a? +t); 
b) (1 + ab — abev—"*)-1/, 
) max{1 — |t|/a, 0}; 
d) 2(1 — cos at) /(a?t?); 
e) el", where 0 <a < 2; 
f) |d|?, where ¢ is a ch-f. on R; 
g) { o(ut)dG(u), where ¢ is a ch.f. on R and G is ac.d-f. on R. 


Let ¢, be the ch.f. of a probability measure P,,, n = 1, 2,.... Let {an} 
be a sequence of nonnegative numbers with }7~°_, dn = 1. Show that 
yO G@ngn is a ch.f. and find its corresponding probability measure. 


Let X be a random variable whose ch.f. ¢x satisfies [ |¢x(t)|dt < oo. 
Show that (27)~1 [ e~V~!"" x (t)dt is the Lebesgue p.d.f. of X, 


A random variable X or its distribution is of the lattice type if and 
only if Fx(#) = ae Pilpa+ja}("), © € R, where a, d, p;’s are 


82 


77. 


78. 


79. 


80. 


81. 


82. 


1. Probability Theory 


constants, d > 0, p; > 0, and Be ies p; = 1. Show that X is of the 
lattice type if and only if its ch.f. satisfies |¢x (t)| = 1 for some t ¥ 0. 


Let ¢ be a ch.f. on R. Show that 

(a) if |o(t1)| = |@(t2)| = 1 and t,/tg is an irrational number, then 
o(t) = e¥—! for some constant a; 

(b) if t, > 0, tn £0, and |d(t,)| = 1, then the result in (a) holds; 
(c) |cost| is not a ch.f., although cost is a ch.f. 


Let X1,...,X% be independent random variables and Y = X,+---+Xpz. 
Prove the following statements, using Theorem 1.6 and result (1.58). 
(a) If X; has the binomial distribution Bi(p,n;), i = 1,...,k, then Y 
has the binomial distribution Bi(p,n1 +--+ + nx). 

(b) If X; has the Poisson distribution P(0;), i = 1,...,k, then Y has 
the Poisson distribution P(6; +---+ 6x). 

(c) If X; has the negative binomial distribution N B(p,r;), i = 1,...,k, 
then Y has the negative binomial distribution N B(p,r1 +--+ + rx). 
(d) If X; has the exponential distribution (0,0), i =1,...,4, then Y 
has the gamma distribution I'(k, 0). 

(e) If X; has the Cauchy distribution C(0,1), i = 1,...,k, then Y/k 
has the same distribution as X,. 


Find an example of two random variables X and Y such that X and 
Y are not independent but their ch.f.’s satisfy dx (t)dy (t) = dx+y (t) 
for allt € R. 


Let X1, X2,... be independent random variables having the exponen- 
tial distribution (0,0). For given t > 0, let Y be the maximum of n 
such that T,, < t, where Tp = 0 and T, = X1+---+Xn,n=1,2,.... 
Show that Y has the Poisson distribution P(t/@). 


Let © be a k x k nonnegative definite matrix. 

(a) For a nonsingular 4, show that X is Nz(, 4) if and only if c7X 
is N(c™,c7™Xc) for any ce R*. 

(b) For a singular ©, we define X to be Nz (fs, ©) if and only if c7X is 
N(c™p,c7Xc) for any c € R* (N(a,0) is the c.d.f. of the point mass 
at a). Show that the results in Exercise 43(a)-(c), Exercise 58(a)-(b), 
and Exercise 65(a) still hold for X = N;,(u,¥) with a singular ¥. 


Let (X1, X2) be Nz(u, &) with a k x k positive definite 
X11 Lae ) 
D= ; 
( digi Liga 
where X, is a random I[-vector and 1, is an / x 1 matrix. Show that 


the conditional Lebesgue p.d.f. of X2 given X1 = 2 is 
Np—i (H2 + E21 yy (x1 — p41), Hae — Lai Dy Liz) , 


1.6. Exercises 83 


83. 


84. 
85. 


86. 


87. 


88. 


89. 


90. 


91. 


where p; = EX;,i = 1,2. (Hint: consider X2— p2—Lo1U7) (X1— p11) 
and Xy = [1-) 


Let X be an integrable random variable with a Lebesgue p.d.f. fx 
and let Y = g(X), where g is a function with positive derivative on 
(0,00) and g(a) = g(—a). Find an expression for E(X|Y) and verify 
that it is indeed the conditional expectation. 


Prove Lemma 1.2. (Hint: first consider simple functions.) 


Prove Proposition 1.10. (Hint for proving (ix): first show that 0 < 
X,< Xo <--- and X, a5. X imply E(X,|A) 0.5. E(X|A).) 


Let X and Y be integrable random variables on (Q, F, P) and AC F 
be a o-field. Show that E[Y E(X|A)] = E[|X E(Y|A)], assuming that 
both integrals exist. 


Let X,X1, X2,... be a sequence of random variables on (Q, 7, P) and 
AC F bea o-field. Suppose that E(X,Y) — E(XY) for every inte- 
grable (or bounded) random variable Y. Show that E[E(X,|A)Y] - 
E|E(X|A)Y] for every integrable (or bounded) random variable Y. 


Let X be a nonnegative integrable random variable on (Q, 7, P) and 
AC F be a o-field. Show that E(X|A) = [5° P(X > t|A)dt as. 


Let X and Y be random variables on (Q,F,P) and AC F be ao- 
field. Prove the following inequalities for conditional expectations. 
(a) If E|X|? <oo and E|Y|4<co for constants p and q with p>1 and 
pt +q-} =1, then E(\XY||A) < [E(XPIAP (IY 914) as. 
(b) If E|X|P < co and E|Y|? < oo for a constant p > 1, then 
[E(|X + Y|P|A)}/? < [E(XPIA)]? + [E(YP|A)]? as. 

(c) If f is a convex function on R, then f(E(X|A)) < E[f(X)|A] as. 


Let X and Y be random variables on a probability space with Y = 
E(X|Y) a.s. and let y be a nondecreasing convex function on [0, co). 
(a) Show that if Ey(|X]) < oo, then Ey(|Y|) < co. 

(b) Find an example in which Ey(|Y|) < co but Ey(|X|) = o. 

(c) Suppose that Ey(|X|) = Ey(|Y|) < co and yg is strictly convex 
and strictly increasing. Show that X = Y as. 


Let X, Y, and Z be random variables on a probability space. Suppose 
that E|X| < co and Y = A(Z) with a Borel h. Show that 

(a) if X and Z are independent and E|Z| < oo, then E(XZ|Y) = 
E(X)E(Z|Y) as.; 

(b) if E[f(X)|Z] = f(Y) for all bounded continuous functions f on 
R, then X =Y as.; 

(c) if E[ f(X)|Z] > f(Y) for all bounded, continuous, nondecreasing 
functions f on R, then X >Y as. 


84 


92. 


93. 


94. 


95. 


96. 


97. 


98. 


99. 


100. 


101. 


1. Probability Theory 


Prove Lemma 1.3. 


Show that random variables X;, 7 = 1,...,n, are independent accord- 
ing to Definition 1.7 if and only if (1.7) holds with F being the joint 
c.d.f. of X;’s and F; being the marginal c.d.f. of X;. 


Show that a random variable X is independent of itself if and only if 
X is constant a.s. Can X and f(X) be independent for a Borel f? 


Let X, Y, and Z be independent random variables on a probability 
space and let VU = X+Z and V = Y + Z. Show that given Z, U and 
V are conditionally independent. 


Show that the result in Proposition 1.11 may not be true if Y2 is 
independent of X but not (X,Y1). 


Let X and Y be independent random variables on a probability space. 
Show that if E|X|* < co for some a > 1 and E|Y| < o, then 
BIX EY |? = BIXBY |. 


Let Py be a discrete distribution on {0,1,2,...} and Pxjy=, be the 
binomial distribution Bi(p, y). Let (X,Y) be the random vector hav- 
ing the joint c.d-f. given by (1.66). Show that 

(a) if Y has the Poisson distribution P(@), then the marginal distri- 
bution of X is the Poisson distribution P(p@); 

(b) if Y +r has the negative binomial distribution N B(z,r), then the 
marginal distribution of X + r is the negative binomial distribution 
NB(a/[l — (1 —p)(1—2)},7). 


Let X1, X2,... be iid. random variables and Y be a discrete random 
variable taking positive integer values. Assume that Y and X;’s are 
independent. Let Z = ae Xi. 

(a) Obtain the ch.f. of Z. 

(b) Show that EZ = EY EX. 

(c) Show that Var(Z) = EY Var(X1) + Var(Y)(EX1)?. 


Let X, Y, and Z be random variables having a positive joint Lebesgue 
pdf. Let fxjy(2ly) and fx)y,z(aly,z) be respectively the condi- 
tional p.d.f. of X given Y and the conditional p.d.f. of X given 
(Y,Z), as defined by (1.61). Show that Var(1/fxjy(X|Y)|X) < 


Var(1/fxjy,z(X|Y, Z)|X) a.s., where Var(|¢) = E{[E — E(E|¢)]"|¢} 
for any random variables € and ¢ with E€? < oo. 


Let {X,} be a Markov chain. Show that if g is a one-to-one Borel 
function, then {g(X,,)} is also a Markov chain. Give an example to 
show that {g(X,,)} may not be a Markov chain in general. 


1.6. Exercises 85 


102. 


103. 


104. 
105. 


106. 


107. 
108. 


109. 


110. 


111. 


A sequence of random vectors {X,,} is said to be a Markov chain of or- 
der r for a positive integer r if P(B|.Xy,..., Xn) =P(B|Xn_r4i,..-, Xn) 
a.s. for any B € o(Xn4i) andn=r,r+1,.... 

(a) Let s > r be two positive integers. Show that if {X,,} is a Markov 
chain of order r, then it is a Markov chain of order s. 

(b) Let {X,,} be a sequence of random variables, r be a positive inte- 
ger, and Y, = (Xn, Xn41,-.-;Xn4+r—1). Show that {Y,,} is a Markov 
chain if and only if {X,,} is a Markov chain of order r. 

(c) (Autoregressive process of order r). Let {€,} be a sequence of 
independent random variables and r be a positive integer. Show that 
{X,,} is a Markov chain of order r, where X, = ee pjXn—j tEn 
and p,;’s are constants. 


Show that if {Xn,¥n} is a martingale (or a submartingale), then 
E(Xn4;5|Fn) = Xn as. (or E(Xn4;|Fn) > Xn as.) and EX, = EX; 
(or EX, < EX2 <---) for any j = 1,2,.... 


Show that {X,,} in Example 1.25 is a martingale. 


Let {X,;} and {Z;} be sequences of random variables and let f,, and 
Qn denote the Lebesgue p.d-f.’s of Y,, = (Xq1,..., Xn) and (Z1,..., Zn), 
respectively, n = 1,2,..... Define An = —9n(Yn)/fn(Yn) Lf fr (vn) >0}5 
n =1,2,.... Show that {A,} is a submartingale. 


Let {Y,,} be a sequence of independent random variables. 

(a) Suppose that EY, = 0 for all n. Let X; = Yi; and Xy41 = 
Xn + Yngihn(X1,...,Xn), 2 > 2, where {h,,} is a sequence of Borel 
functions. Show that {X,,} is a martingale. 

(b) Suppose that EY, = 0 and Var(Y,,) = o? for all n. Let X, = 
(doj-1 Yj)? — no. Show that {X;,,} is a martingale. 

(c) Suppose that Y, > 0 and EY, = 1 for all n. Let X, = Yi---Yn. 
Show that {X,,} is a martingale. 


Prove the claims in the proof of Proposition 1.14. 


Show that every sequence of integrable random variables is the sum 
of a supermartingale and a submartingale. 


Let {X,,} be a martingale. Show that if {X,,} is bounded either above 
or below, then sup, E|Xn| < co. 


Let {X,} be a martingale satisfying EX; = 0 and EX? < ow for all 
n. Show that E(Xntm— Xn)? = yy E(Xnty — Xn4j-1)” and that 
{X,,} converges a.s. 


Show that {X,,} in Exercises 104, 105, and 106(c) converge a.s. to 
integrable random variables. 


86 


112. 
113. 


114. 


115. 


116. 


117. 


118. 


119. 


120. 


121. 


122. 


1. Probability Theory 


Prove Proposition 1.16. 


In the proof of Lemma 1.4, show that {w : limp—oo Xn(w) = X(w)} = 
NL A;- 


Let {X,,} be a sequence of independent random variables. Show that 
Xn —a.s. 0 if and only if, for any € > 0, 0°, P(|Xn| > €) < 00. 


Let X 1, X2,... be a sequence of identically distributed random vari- 
ables with a finite E|X,| and let Y,, =~! maxi<» |X;|. Show that 
(a) Yn Th 0; 

(b) Yn —a.s. 0. 


Let X,X 1, X2,... be random variables. Find an example for each of 
the following cases: 

(a) Xn +p X, but {X,,} does not converge to X a.s. 

(b) Xn +p X, but {X,,} does not converge to X in L, for any p > 0. 
(c) Xn a X, but {X,,} does not converge to X in probability (do 
not use Example 1.26). 

(d) Xn py X, but {g(X,,)} does not converge to g(X) in probability 
for some function g. 


Let X1, Xo,... be random variables. Show that 

(a) {|X|} is uniformly integrable if and only if sup,, E|X,| < oo and, 
for any € > 0, there is a 6, > 0 such that sup,, E(|X,|L4) < € for any 
event A with P(A) < 6; 

(b) sup,, E|X,|'+° < 00 for a 6 > 0 implies that {|X,,|} is uniformly 
integrable. 


Let X,X1,Xo2,... be random variables satisfying P(|X,| > c) < 
P(|X| > c) for all n and c > 0. Show that if E|X| < oo, then 
{|X,|} is uniformly integrable. 


Let X1, Xo,... and Y,, Y2,... be random variables. Show that 

(a) if {|X,,|} and {|Y,,|} are uniformly integrable, then {|X,+Y,]|} is 
uniformly integrable; 

(b) if {|X,|} is uniformly integrable, then {|n~! 5°; , X;|} is uni- 
formly integrable. 


Let Y be an integrable random variable and {F,,} be a sequence of 
o-fields. Show that {|E(Y|F,,)|} is uniformly integrable. 


Let X,Y, X 1, X2,... be random variables satisfying X, —, X and 
P(|X,| < |Y|) = 1 for all n. Show that if E|Y|" < oo for some r > 0, 
then X, 1, X. 


Let X,, Xo, ... be a sequence of random k-vectors. Show that X, —»y 0 
if and only if E[||Xn||/(1 + ||Xnll)] > 0. 


1.6. Exercises 87 


123. 


124. 


125. 


126. 
127. 


128. 


129. 


130. 


131. 


132. 


133. 


Let X, X1,X2,... be random variables. Show that X, —, X if and 
only if, for any subsequence {n,;} of integers, there is a further sub- 
sequence {nj} C {ng} such that Xn, —a.s. X as j — 00. 


Let X 1, X2,... be a sequence of random variables satisfying |X,| < C1 
and Var(X,,) > C2 for all n, where C;’s are positive constants. Show 
that X, —, 0 does not hold. 


Prove Lemma 1.5. (Hint for part (ii): use Chebyshev’s inequality 
to show that P()7~~, Ja, = co) = 1, which can be shown to be 
equivalent to the result in (ii).) 


Prove part (vii) of Theorem 1.8. 


Let X, X1, Xoe,..., Yi, Yo,..., 21, Z2,... be random variables. Prove 
the following statements. 

(a) If X,, >a X, then X,, = O,(1). 

(b) If X, = Op(Zn) and P(Y, = 0) =0, then X,Y, = Op(YnZn). 

(c) If Xn = O,(Zn) and Y, = O,(Z,), then Xn + Yn = Op(Zn). 

(d) If E|.X,,| = O(a), then X;, = O,(an), where a, € (0,00). 
(e) If Xn as. X, then sup, |X,| = Op(1). 


Let {X,,} and {Y,,} be two sequences of random variables such that 
Xn = O,(1) and P(X, <t, Yn, > t+e)+P(X, >t+e,Y, < t) =o(1) 
for any fixed t € R and € > 0. Show that X, — Y,, = 0,(1). 


Let {F,,} be a sequence of c.d.f.’s on R, Gn(x) = Fr(anx + cn), and 
A, (x) = F,(bnx+d,), where {a,,} and {b,} are sequences of positive 
numbers and {c,} and {d,} are sequences of real numbers. Suppose 
that Gn -w G and Hy, —y H, where G and H are nondegenerate 
c.d.f.’s. Show that an/b, > a > 0, (en — dn)/an > b © R, and 
H(ax +b) = G(x) for alla eR. 


Let {P,,} be a sequence of probability measures on (R, 8) and f bea 
nonnegative Borel function such that sup,, f fdP, < co and f(x) > 0 
as |x| > oo. Show that {P,,} is tight. 


Let P, P,, P2,... be probability measures on (R*,B*). Show that if 
P,,(O) — P(O) for every open subset of R, then P,(B) — P(B) for 
every B € BY. 


Let P, P,, P2,... be probability measures on (R, B). Show that P, wu 
P if and only if there exists a dense subset D of R such that 
limn—oo Pr((a, b]) = P((a, 6]) for any a< b,a€ D and be D. 


Let F,, n = 0,1,2,..., be c.d.f.’s such that F,, >» Fo. Let G,(U) = 
sup{x : F(x) < U}, n = 0,1,2,..., where U is a random variable 
having the uniform U(0, 1) distribution. Show that G,,(U) -, Go(U). 


88 


134. 


135. 


136. 


137. 


138. 


139. 


140. 


141. 


142. 


143. 


144. 


145. 


146. 


1. Probability Theory 


Let P, P,, P2,... be probability measures on (R,B). Suppose that 
Py —w P and {gn} is a sequence of bounded continuous functions on 
R converging uniformly to g. Show that [ gndP, — { gdP. 


Let X, X 1, X2,... be random k-vectors and Y, Yj, Y2,... be random I- 
vectors. Suppose that X, —-q X, Yn -a Y, and X, and Y,, are 
independent for each n. Show that (Xn, Y,) converges in distribution 
to a random (k + 1)-vector. 


Let X1, X2,... be independent random variables with P(X, = +27") 
= $,n=1,2,.... Show that 77, X; a U, where U has the uniform 
distribution U(—1, 1). 

Let {X,} and {Y,,} be two sequences of random variables. Suppose 
that X, —q X and that Py,|x,=27,, ~w Py almost surely for every 
sequence of numbers {x,}, where X and Y are independent random 
variables. Show that X, + Yn—7¢X+Y. 


Let X,, X2,... be iid. random variables having the ch.f. of the form 
1—clt|* + o(|t|*) as t > 0, where 0 < a < 2. Determine the constants 
band u so that 5>"_, X;/(bn") converges in distribution to a random 
variable having ch.f. e~!*!". 


Let X, X1, X2,... be random k-vectors and Aj, Ag,... be events. Sup- 
pose that X, 4X. Show that X,J4, aX if and only if P(A,,) — 1. 


Let X,, be a random variable having the N(n, 02) distribution, n = 
1, 2,..., and X be a random variable having the N(, 07) distribution. 
Show that X, -4 X if and only if uw, — pp and oy, — o. 


Suppose that X,, is a random variable having the binomial distribu- 
tion Bi(p,,n). Show that if np, > 6 > 0, then X,, >4 X, where X 
has the Poisson distribution P(6). 


Let f, be the Lebesgue p.d.f. of the t-distribution t,, n = 1,2,.... 
Show that f,(z) — f(x) for any 7 € R, where f is the Lebesgue 
p.d.f. of the standard normal distribution. 


Prove Theorem 1.10. 


Show by example that X, —q X and Y, —a Y does not necessarily 
imply that g(Xn, Yn) ~a g(X,Y), where g is a continuous function. 


Prove Theorem 1.11(ii)-(iii) and Theorem 1.12(ii). Extend Theorem 
1.12(i) to the case where g is a function from R? to R42 with 2 < q < p. 


Let Uj, U2,... be iid. random variables having the uniform distribu- 
tion on [0,1] and Y, = ([].,U;)'/”. Show that /n(¥n — e) a 
N(0, e?). 


1.6. Exercises 89 


147. 


148. 


149. 


150. 


151. 


152. 


153. 


154. 


155. 


156. 


Prove Lemma 1.6. (Hint: a7! ee a;,! a, bi(ai41 — Gi), 
where by, = pee xi/ di.) 


In Theorem 1.13, 

(a) prove (1.82) for bounded c;’s when E|X4| < 00; 

(b) show that if EX, = oo, then n7! S77, Xi as. 00; 

(c) show that if E|X,| = oo, then P(limsup,,{| 7;_, Xi] > en}) = 
P(limsup,,{|Xn| > cn}) = 1 for any fixed positive constant c, and 
lim sup, |n~* S77, Xi| = 00 as. 


Let X1,...,Xp be iid. random variables such that for « = 3,4,..., 
P(X, = +2) = (2cz? logz)~', where c = 0, 277/logz. Show 
that E|X1| =0o but n7! S0"_, X; —, 0, using Theorem 1.13(i). 


Let X1, Xo, ... be i.i.d. random variables satisfying P(X, = 2’) = 2-/, 
j =1,2,.... Show that the WLLN does not hold for {Xy}, i-e., (1.80) 
does not hold for any {ay}. 


Let X,, Xo,... be independent random variables. Suppose that, as 
nm — 0, D1 P(|Xi| > n) > 0 and n-? YY) E(XPI yx, i<ny) 

0. Show that (T, — b,)/n —»p 0, where T, = S0"_, X; and by, = 
Dia E(XiL(x:1<n})- 


Let T, = — yee , Xi, where X,,’s are independent random variables 
satisfying P(X, = +n°) = 0.5 and 6 > 0 is a constant. Show that 
(a) when 0 < 0.5, T,/n —a.s. 0; 

(b) when 6 > 1, T,,/n —, 0 does not hold. 


Let X2, X3,... be a sequence of independent random variables satis- 
fying P(X, = +\/n/logn) = 0.5. Show that (1.86) does not hold for 
p € [1,2] but (1.88) is satisfied for p = 2 and, thus, (1.89) holds. 


Let X1,...,X» be iid. random variables with Var(X1) < co. Show 
that [n(n + 1))~* 051 5X3 Sp EX. 


Let {X,,} be a sequence of random variables and let X = )7i_, X;/n. 
(a) Show that if X,, as. 0, then X 4.5. 0. 

(b) Show that if X,, >, 0, then X — ;, 0, where r > 1 is a constant. 
(c) Show that the result in (b) may not be true for r € (0,1). 

(d) Show that X;, +, 0 may not imply X —, 0. 


Let X4,...,Xn be random variables and {fun}, {on}, {an}, and {b,} 
be sequences of real numbers with a, > 0 and a, > 0. Suppose that 
X,, is asymptotically distributed as N({in, 02). Show that anXn + bn 
is asymptotically distributed as N(n,02) if and only if a, — 1 and 
[Un(@n — 1) + bp]/on — 0. 


90 


157. 


158. 


159. 


160. 


161. 


162. 


163. 


164. 


165. 


166. 


167. 
168. 


1. Probability Theory 


Show that Liapounov’s condition (1.97) implies Lindeberg’s condition 
(1.92). 


Let X1, X2,... be a sequence of independent random variables and 
= Var (doy —1 Xj). 

(a) Show that if X, = N(0,27"), n =1,2,..., then Feller’s condition 
(1.96) does not hold but }7_4 (Xj; — EXj)/on a N(0, 1). 

(b) Show that the result in (a) is still true if X, has the uniform 
distribution U(—1,1) and X,, = N(0,2"-1), n =2,3,.... 


In Example 1.33, show that 
(a) the condition 02 — oo is also necessary for (1.98); 


(b) n-? Ya (X —p;) 1, 0 for any constant r > 0; 
(c) et pa O.€ Sp) eg 0 


Prove Corollary 1.3. 


Suppose that X,, is a random variable having the binomial distribu- 
tion Bi(0,n), where 0 <0 <1, n =1,2.,.... Define Y, = log(X,,/n) 
when X, > 1 and Y, = 1 when X,, = 0. Show that Y;,, <4... log @ 
and /n(Y, — log@) +4 N (0,454). Establish similar results when 
X,, has the Poisson distribution P(n6). 


Let X1, X2,... be independent random variables such that X; has the 
uniform distribution on [—j,j], 7 = 1,2,..... Show that Lindeberg’s 
condition is satisfied and state the resulting CLT. 


Let X,, X2,... be independent random variables such that for 7 = 
Ay Dass PN = gor 9 Ae and P(X Oy a1 = 3 gee), 
where a > 1 isaconstant. Show that Lindeberg’s condition is satisfied 
if and only if a < 1.5. 


Let X), X2,... be independent random variables with P(X; = +j%) = 
P(X; = 0) = 1/3, where a > 0, 7 = 1,2,.... Can we apply Theorem 
1.15 to {X,;} by checking Liapounov’s condition (1.97)? 

Let {X,,} be a sequence of independent random variables. Suppose 
that )0_,(X; — BX;)/on +a N(0,1), where of = Var(d77_, Xj). 
Show that n~" )0"_, (Xj; — EX;) —» 0 if and only if a = o(n). 


Consider Exercise 152. Show that T,/./Var(Tn) a N(0,1) and, 
when 0.5 < 6 <1, T,,/n —, 0 does not hold. 


Prove (1.102)-(1.104). 


In Example 1.34, prove ¢7 = 1 for /n(X — w)/é and /n(6? — 0?)/7 
and derive the expressions for p;(x) in all four cases. 


Chapter 2 


Fundamentals of Statistics 


This chapter discusses some fundamental concepts of mathematical statis- 
tics. These concepts are essential for the material in later chapters. 


2.1 Populations, Samples, and Models 


A typical statistical problem can be described as follows. One or a series of 
random experiments is performed; some data from the experiment(s) are 
collected; and our task is to extract information from the data, interpret 
the results, and draw some conclusions. In this book we do not consider 
the problem of planning experiments and collecting data, but concentrate 
on statistical analysis of the data, assuming that the data are given. 


A descriptive data analysis can be performed to obtain some summary 
measures of the data, such as the mean, median, range, standard devia- 
tion, etc., and some graphical displays, such as the histogram and box- 
and-whisker diagram, etc. (see, e.g., Hogg and Tanis (1993)). Although 
this kind of analysis is simple and requires almost no assumptions, it may 
not allow us to gain enough insight into the problem. We focus on more 
sophisticated methods of analyzing data: statistical inference and decision 
theory. 


2.1.1 Populations and samples 
In statistical inference and decision theory, the data set is viewed as a real- 
ization or observation of a random element defined on a probability space 


(Q, F, P) related to the random experiment. The probability measure P is 
called the population. The data set or the random element that produces 


91 


92 2. Fundamentals of Statistics 


the data is called a sample from P. The size of the data set is called the 
sample size. A population P is known if and only if P(A) is a known value 
for every event A € F. Ina statistical problem, the population P is at least 
partially unknown and we would like to deduce some properties of P based 
on the available sample. 


Example 2.1 (Measurement problems). To measure an unknown quan- 
tity 6 (for example, a distance, weight, or temperature), n measurements, 
X1,..-;Xn, are taken in an experiment of measuring 6. If @ can be measured 
without errors, then x; = 9 for all 7; otherwise, each x; has a possible mea- 
surement error. In descriptive data analysis, a few summary measures may 
be calculated, for example, the sample mean 


and the sample variance 


However, what is the relationship between % and 6? Are they close (if 
not equal) in some sense? The sample variance s? is clearly an average of 
squared deviations of x;’s from their mean. But, what kind of information 
does s? provide? Finally, is it enough to just look at Z and s? for the purpose 
of measuring 0? These questions cannot be answered in descriptive data 
analysis. 

In statistical inference and decision theory, the data set, (x1,...,%n), is 
viewed as an outcome of the experiment whose sample space is 2 = FR”. 
We usually assume that the n measurements are obtained in n indepen- 
dent trials of the experiment. Hence, we can define a random n-vector 
X = (X,...,Xn) on []j_, (R, B, P) whose realization is (71,...,2,). The 
population in this problem is P (note that the product probability measure 
is determined by P) and is at least partially unknown. The random vector 
X is a sample and n is the sample size. Define 


X= 25 (2.1) 


and 


eo Sak) (2.2) 


Then X and S$? are random variables that produce Z and s?, respectively. 
Questions raised previously can be answered if some assumptions are im- 
posed on the population P, which are discussed later. 


2.1. Populations, Samples, and Models 93 


When the sample (X1,..., X,) has i.i.d. components, which is often the 
case in applications, the population is determined by the marginal distri- 
bution of X;. 


Example 2.2 (Life-time testing problems). Let 71, ...,%p be observed life- 
times of some electronic components. Again, in statistical inference and 
decision theory, 21, ...,%, are viewed as realizations of independent random 
variables X,,...,X». Suppose that the components are of the same type 
so that it is reasonable to assume that X1,...,X;, have a common marginal 
c.d.f. F. Then the population is F', which is often unknown. A quantity of 
interest in this problem is 1 — F(t) with a t > 0, which is the probability 
that a component does not fail at time t. It is possible that all x;’s are 
smaller (or larger) than t. Conclusions about 1 — F(t) can be drawn based 
on data 21,...,%, when certain assumptions on F' are imposed. I 


Example 2.3 (Survey problems). A survey is often conducted when one is 
not able to evaluate all elements in a collection P = {y1,..., yn} containing 
N values in R”, where k and N are finite positive integers but N may be 
very large. Suppose that the quantity of interest is the population total 
Y = 3-1 yi. In a survey, a subset s of n elements are selected from 
{1,..., N} and values y;, i € s, are obtained. Can we draw some conclusion 
about Y based on data y;, 7 € s? 


How do we define some random variables that produce the survey data? 
First, we need to specify how s is selected. A commonly used probability 
sampling plan can be described as follows. Assume that every element in 
{1,..., N} can be selected at most once, i.e., we consider sampling without 
replacement. Let S be the collection of all subsets of n distinct elements 
from {1,..., N}, F, be the collection of all subsets of S, and p be a probabil- 
ity measure on (S,F,). Any s € S is selected with probability p(s). Note 
that p(s) is a known value whenever s is given. Let X1,...,X» be random 
variables such that 


p(s) 


P(X = Oey dg on, = Yin) = are 
ne 


8 = {i1,..,in} ES. (2.3) 
Then (y;,7 € s) can be viewed as a realization of the sample (Xj,..., Xn). 
If p(s) is constant, then the sampling plan is called the simple random 
sampling (without replacement) and (X1,..., Xn) is called a simple random 
sample. Although Xj,...,X, are identically distributed, they are not nec- 
essarily independent. Thus, unlike in the previous two examples, the pop- 
ulation in this problem may not be specified by the marginal distributions 
of X;’s. The population is determined by P and the known selection prob- 
ability measure p. For this reason, P is often treated as the population. 
Conclusions about Y and other characteristics of P can be drawn based on 
data y;, 7 € s, which are discussed later. 


94 2. Fundamentals of Statistics 


2.1.2 Parametric and nonparametric models 


A statistical model (a set of assumptions) on the population P in a given 
problem is often postulated to make the analysis possible or easy. Although 
testing the correctness of postulated models is part of statistical inference 
and decision theory, postulated models are often based on knowledge of the 
problem under consideration. 


Definition 2.1. A set of probability measures Pg on (0, F) indexed by a 
parameter 0 € @ is said to be a parametric family if and only if @ Cc R4 for 
some fixed positive integer d and each Pg is a known probability measure 
when @ is known. The set O is called the parameter space and d is called 
its dimension. Wi 


A parametric model refers to the assumption that the population P is 
in a given parametric family. A parametric family {P, : 6 € O} is said to 
be identifiable if and only if 6; 4 62 and 0; € © imply Ps, # Po,. In most 
cases an identifiable parametric family can be obtained through reparame- 
terization. Hence, we assume in what follows that every parametric family 
is identifiable unless otherwise stated. 


Let P be a family of populations and v be a o-finite measure on (Q, F). 
If P< v for all P € P, then P is said to be dominated by v, in which case P 
can be identified by the family of densities {42 : P € P} (or {42 : 6 € O} 
for a parametric family). 

Many examples of parametric families can be obtained from Tables 1.1 
and 1.2 in §1.3.1. All parametric families from Tables 1.1 and 1.2 are 
dominated by the counting measure or the Lebesgue measure on 7. 


Example 2.4 (The k-dimensional normal family). Consider the normal 
distribution N;,(~, £) given by (1.24) for a fixed positive integer &. An im- 
portant parametric family in statistics is the family of normal distributions 


P={Ni(u,5): wER*, DE Mg}, 


where M, is a collection of k x k symmetric positive definite matrices. This 
family is dominated by the Lebesgue measure on R*. 


In the measurement problem described in Example 2.1, X;’s are often 
iid. from the N(y,07) distribution. Hence, we can impose a parametric 
model on the population, i.c., P€ P= {N(pu,07): we R, a7 > O}. 

The normal parametric model is perhaps not a good model for the life- 
time testing problem described in Example 2.2, since clearly X; > 0 for 
all i. In practice, the normal family {N(u,07) : pw € R, o? > O} can 
be used for a life-time testing problem if one puts some restrictions on pw 
and o so that P(X; < 0) is negligible. Common parametric models for 


2.1. Populations, Samples, and Models 95 


life-time testing problems are the exponential model (containing the expo- 
nential distributions (0,0) with an unknown parameter 0; see Table 1.2 
in §1.3.1), the gamma model (containing the gamma distributions I'(a, 7) 
with unknown parameters a and ¥), the log-normal model (containing the 
log-normal distributions LN (1, 07) with unknown parameters jz and c), the 
Weibull model (containing the Weibull distributions W(a, 6) with unknown 
parameters a and 6), and any subfamilies of these parametric families (e.g., 
a family containing the gamma distributions with one known parameter and 
one unknown parameter). 


The normal family is often not a good choice for the survey problem 
discussed in Example 2.3. 1 


In a given problem, a parametric model is not useful if the dimension 
of © is very high. For example, the survey problem described in Example 
2.3 has a natural parametric model, since the population P can be indexed 
by the parameter 6 = (y1,..., yn). If there is no restriction on the y-values, 
however, the dimension of the parameter space is kN, which is usually much 
larger than the sample size n. If there are some restrictions on the y-values 
(for example, y;’s are nonnegative integers no larger than a fixed integer 
m), then the dimension of the parameter space is at most m+ 1 and the 
parametric model becomes useful. 


A family of probability measures is said to be nonparametric if it is not 
parametric according to Definition 2.1. A nonparametric model refers to the 
assumption that the population P is in a given nonparametric family. There 
may be almost no assumption on a nonparametric family, for example, the 
family of all probability measures on (R*,B"). But in many applications, 
we may use one or a combination of the following assumptions to form a 
nonparametric family on (R*, B*): 

(1) The joint c.d.f.’s are continuous. 

(2) The joint c.d.f.’s have finite moments of order < a fixed integer. 
(3) The joint c.d.f.’s have p.d.f.’s (e.g., Lebesgue p.d.f.’s). 

(4) k =1 and the c.d.f.’s are symmetric. 


For instance, in Example 2.1, we may assume a nonparametric model 
with symmetric and continuous c.d.f.’s. The symmetry assumption may 
not be suitable for the population in Example 2.2, but the continuity as- 
sumption seems to be reasonable. 


In statistical inference and decision theory, methods designed for para- 
metric models are called parametric methods, whereas methods designed 
for nonparametric models are called nonparametric methods. However, 
nonparametric methods are used in a parametric model when paramet- 
ric methods are not effective, such as when the dimension of the parameter 


96 2. Fundamentals of Statistics 


space is too high (Example 2.3). On the other hand, parametric methods 
may be applied to a semi-parametric model, which is a nonparametric model 
having a parametric component. Some examples are provided in 85.1.4. 


2.1.3 Exponential and location-scale families 


In this section, we discuss two types of parametric families that are of 
special importance in statistical inference and decision theory. 


Definition 2.2 (Exponential families). A parametric family {P : 6 € O} 
dominated by a o-finite measure v on (Q, F) is called an exponential family 
if and only if 


dPs 


Gy #) = expt ln? Tw) — E)} hw), wea, (2.4) 


where exp{xz} = e”, T is a random p-vector with a fixed positive integer p, 
7 is a function from © to R?, h is a nonnegative Borel function on (Q, F), 


and €(@) = log ts exp{[n(0)|"T'(w) }h(w)dv(w) }. | 


In Definition 2.2, T and h are functions of w only, whereas 7 and & 
are functions of 6 only. Q is usually R*. The representation (2.4) of an 
exponential family is not unique. In fact, any transformation 7(0) = Dn(6) 
with a p xX p nonsingular matrix D gives another representation (with T 
replaced by T = (D7)~!T). A change of the measure that dominates the 
family also changes the representation. For example, if we define \(A) = 
f ,/hdv for any A € F, then we obtain an exponential family with densities 


TD w) = exp{[7(9)]” Tw) — €(6)}. (2.5) 
In an exponential family, consider the reparameterization 7 = 7(0) and 


fnlw) = exp{n’T(w) — C(n) Faw), wed, (2.6) 


where ¢(7) = log { {, exp{n’ Tw) }h(w)dv(w)}. This is the canonical form 
for the family, which is not unique for the reasons discussed previously. The 
new parameter 77 is called the natural parameter. The new parameter space 
= = {n(0) : 0 € O}, a subset of R?, is called the natural parameter space. 
An exponential family in canonical form is called a natural exponential 
family. If there is an open set contained in the natural parameter space of 
an exponential family, then the family is said to be of full rank. 


Example 2.5. Let P, be the binomial distribution Bi(0,n) with param- 
eter 0, where n is a fixed positive integer. Then {Py : 6 € (0,1)} is an 


2.1. Populations, Samples, and Models 97 


exponential family, since the p.d.f. of Pg w.r.t. the counting measure is 


fol) = exp { rlog 25 + ntog(~8)} () fos, ..m (0 
(I(x) =a, (0) =log 725 


@? 
If we let 7 = log a the 


€(0)=—n log(1 — 0), and h(x) = (7) Io,1,....n} (2): 
n&=R and the family with p.d.f.’s 


n 
fy(e) = exp {en — nlog(t + €")} () Fo3,..m (0 
is a natural exponential family of full rank. J 


Example 2.6. The normal family {N(u,07) : pp € R,o > O} is an 
exponential family, since the Lebesgue p.d.f. of N(,07) can be written as 


1 1 2 
Wore {Se al is logo} 
Hence, T(x) = («,—2?), (0) = (45, sz), 0 = (4,07), €(0) = 4a + loge, 
and h(x) = 1/V2r. Let 7 = (m,m2) = (4,542). Then = = R x (0,00) 
and we can obtain a natural exponential family of full rank with ¢(7) = 
mi/(An2) + log(1/V2n2). 

A subfamily of the previous normal family, {N(j, 7) : uw € R, pu 4 Of, 
is also an exponential family with the natural parameter 7 = (a ur) and 


natural parameter space = = {(z,y) : y = 227, x € R, y > O}. This 
exponential family is not of fullrank. I 


For an exponential family, (2.5) implies that there is a nonzero measure 


such that a 
pm) >0 for allw and 0. (2.7) 


We can use this fact to show that a family of distributions is not an expo- 
nential family. For example, consider the family of uniform distributions, 
ie., Py is U(0,0) with an unknown 6 € (0,00). If {Py : 6 € (0,co)} is an 
exponential family, then from the previous discussion we have a nonzero 
measure \ such that (2.7) holds. For any t > 0, there is a 6 < ¢ such that 
Po([t,0o)) = 0, which with (2.7) implies that A([t,0co)) = 0. Also, for any 
t < 0, Pe((—o0, t]) = 0, which with (2.7) implies that \((—oo, t]) = 0. Since 
t is arbitrary, 1 = 0. This contradiction implies that {P, : 6 € (0,00)} 
cannot be an exponential family. 

The reader may verify which of the parametric families from Tables 
1.1 and 1.2 are exponential families. As another example, we consider an 
important exponential family containing multivariate discrete distributions. 


98 2. Fundamentals of Statistics 


Example 2.7 (The multinomial family). Consider an experiment having 
k +1 possible outcomes with p; as the probability for the ith outcome, 
= 0513 tk; =o p; = 1. In n independent trials of this experiment, let 
X;, be the number of trials resulting in the ith outcome, 7 = 0,1,...,4. Then 
the joint p.d.f. (w.r.t. counting measure) of (Xo, X1,..., Xx) is 


n! 


fo(to, #1, isha) = Bio — -p, Ip (0,21, ges 


xolay!--- 

where B = {(0,21,...,%) : v;’s are integers > 0, ar x; =n} and d= 
(po, P1,---;Pk). The distribution of (Xo, X1,..., X%) is called the multinomial 
distribution, which is an extension of the binomial distribution. In fact, 
the marginal c.d.f. of each X; is the binomial distribution Bi(p;,n). Let 
@={9ER**: 0< 7 < 1, oy Pi = 1}. The parametric family 
{fo : 0 € O} is called the multinomial family. Let x = (a, %1,...,%%), 
n = (log po, log pi, ...,logp,), and h(a) = [n!/(xo!ay!---a,!)])Ip (x). Then 


fo(%0,%1,---, Ue) = exp {nx} h(a), eRe, (2.8) 


Hence, the multinomial family is a natural exponential family with natural 
parameter 7. However, representation (2.8) does not provide an exponential 
family of full rank, since there is no open set of R**+! contained in the 
natural parameter space. A reparameterization leads to an exponential 
family with full rank. Using the fact that 7", X; =n and * opi = 1, 
we obtain that 


fo(Xo0, £1, .., TK) = exp {nits — Cm) } h(x), x ER, (2.9) 


where @4 = (1,+.,0k), Mm = (log(p1/po),---,log(pr/po)), and C(m) = 
—nlogpo. The n,-parameter space is R*. Hence, the family of densities 
given by (2.9) is a natural exponential family of full rank. I 


If X4,..., Xm are independent random vectors with p.d.f.’s in exponen- 
tial families, then the p.d-f. of (X1,..., Xm) is again in an exponential family. 
The following result summarizes some other useful properties of exponential 
families. Its proof can be found in Lehmann (1986). 


Theorem 2.1. Let P be a natural exponential family given by (2.6). 
(i) Let T = (Y,U) and n = (V,), where Y and ¥ have the same dimension. 
Then, Y has the p.d_f. 


fn(y) = exp{¥Ty — C(n)} 


w.r.t. a o-finite measure depending on y. In particular, T has a p.d.f. ina 
natural exponential family. Furthermore, the conditional distribution of Y 
given U = u has the p.d-f. (w.r.t. a o-finite measure depending on wu) 


fouly) = exp{¥7y — Cul) }, 


2.1. Populations, Samples, and Models 99 


which is in a natural exponential family indexed by v. 
(ii) If jo is an interior point of the natural parameter space, then the m.g_f. 
Wo Of Py 0 T~* is finite in a neighborhood of 0 and is given by 


Wno(t) = exp{¢(no + t) — ¢(70)}- 


Furthermore, if f is a Borel function satisfying [ |f|dP,, < oo, then the 
function 


if f(w) exp{y’ Tw) }a(w)dv(w) 


is infinitely often differentiable in a neighborhood of 79, and the derivatives 
may be computed by differentiation under the integral sign. I 


Using Theorem 2.1(ii) and the result in Example 2.5, we obtain that 
the m.g.f. of the binomial distribution Bi(p,n) is 


w,(t) = exp{nlog(1 + e”**) — nlog(1 + e”)} 
_ (1+eret\” 
~ \ l+en 
= (1—p+pe')”, 
since p = e"/(1 +e"). 


Definition 2.3 (Location-scale families). Let P be a known probability 
measure on (R*, B*), V C R*, and M, be a collection of k x k symmetric 
positive definite matrices. The family 


{Pw) >: wEeVv, LEM,} (2.10) 


is called a location-scale family (on R*), where 
Pye (B)=P(EV*(B-p)), Be Br, 


D-V2(B-p) = {a-/2(2—): « € B} C R*, and 5-/? is the inverse of 
the “square root” matrix D!/? satisfying D!/2D1/? = DO. The parameters pu 
and »!/? are called the location and scale parameters, respectively. 


The following are some important examples of location-scale families. 
The family {P(u.m) : bE R*\ is called a location family, where I, is 
the k x k identity matrix. The family {Po,5) : & € Mg} is called a 
scale family. In some cases, we consider a location-scale family of the form 
tPigeti ys jek R*,o > O}. If Xj,...,X_ are iid. with a common dis- 
tribution in the location-scale family Ls. : ww € R,o > O}, then the 
joint distribution of the vector (Xj,...,X,) is in the location-scale family 
{Pu,02h): # € Vio > 0} with V = {(2,...,2) ER*: cE R}. 


100 2. Fundamentals of Statistics 


A location-scale family can be generated as follows. Let X be a random 
k-vector having a distribution P. Then the distribution of !/?.X + p is 
P(u,s)- On the other hand, if X is a random k-vector whose distribution is 
in the location-scale family (2.10), then the distribution DX + c is also in 
the same family, provided that Du+c¢¢€V and DUD™ € My. 

Let F be the c.d-f. of P. Then the c.d.f. of Py,s) is F (U71/?(a — n)), 
x ER". If F has a Lebesgue p.d.f. f, then the Lebesgue p.d-f. of Pou) 18 
Det(~1/?) f (U71/?(a — p)), « € R* (Proposition 1.8). 

Many families of distributions in Table 1.2 (§1.3.1) are location, scale, or 
location-scale families. For example, the family of exponential distributions 
E(a,@) is a location-scale family on R with location parameter a and scale 
parameter 0; the family of uniform distributions U(0, 6) is a scale family on 
R with a scale parameter 6. The k-dimensional normal family discussed in 
Example 2.4 is a location-scale family on R*. 


2.2 Statistics, Sufficiency, and Completeness 


Let us assume now that our data set is a realization of a sample X (a 
random vector) from an unknown population P on a probability space. 


2.2.1 Statistics and their distributions 


A measurable function of X, T(X), is called a statistic if T(X) is a known 
value whenever X is known, i.e., the function T is a known function. Sta- 
tistical analyses are based on various statistics, for various purposes. Of 
course, X itself is a statistic, but it is a trivial statistic. The range of a 
nontrivial statistic T(X) is usually simpler than that of X. For example, 
X may be a random n-vector and T(X) may be a random p-vector with a 
p much smaller than n. This is desired since T(X) simplifies the original 
data. 


From a probabilistic point of view, the “information” within the statistic 
T(X) concerning the unknown distribution of X is contained in the o- 
field o(T(X)). To see this, assume that S is any other statistic for which 
a(S(X)) = o(T(X)). Then, by Lemma 1.2, S is a measurable function of 
T, and T is a measurable function of S. Thus, once the value of S (or T) is 
known, so is the value of T (or S). That is, it is not the particular values 
of a statistic that contain the information, but the generated o-field of the 
statistic. Values of a statistic may be important for other reasons. 

Note that o(T(X)) C o(X) and the two o-fields are the same if and 
only if T’ is one-to-one. Usually o(Z'(X)) simplifies o(X), i.e., a statistic 
provides a “reduction” of the o-field. 


2.2. Statistics, Sufficiency, and Completeness 101 


Any T(X) is a random element. If the distribution of X is unknown, 
then the distribution of T’ may also be unknown, although T is a known 
function. Finding the form of the distribution of T is one of the major 
problems in statistical inference and decision theory. Since T is a transfor- 
mation of X, tools we learn in Chapter 1 for transformations may be useful 
in finding the distribution or an approximation to the distribution of T(X). 


Example 2.8. Let X,...,X, be iid. random variables having a common 
distribution P and X = (Xj,...,Xn). The sample mean X and sample 
variance S$? defined in (2.1) and (2.2), respectively, are two commonly used 
statistics. Can we find the joint or the marginal distributions of X and $?? 
It depends on how much we know about P. 


First, let us consider the moments of X and $?. Assume that P has a 
finite mean denoted by uw. Then 


EX =. 


If P is in a parametric family {Pp : 0 € O}, then EX = f[ xdPo = (0) 
for some function p(-). Even if the form of pz is known, u(@) may still be 
unknown when @ is unknown. Assume now that P has a finite variance 
denoted by o?. Then 

Var(X) = 0? /n, 
which equals o7(9)/n for some function o?(-) if P is in a parametric family. 
With a finite 0? = Var(X1), we can also obtain that 


ES? =o". 


With a finite E|X,|°, we can obtain E(X)? and Cov(X, $7), and with a 
finite E|.X,|*, we can obtain Var(S?) (exercise). 

Next, consider the distribution of X. If P is in a parametric family, we 
can often find the distribution of X. See Example 1.20 and some exercises 
in §1.6. For example, X is N(,07/n) if P is N(u,07); nX has the gamma 
distribution ['(n, 6) if P is the exponential distribution (0,6). If P is not 
in a parametric family, then it is usually hard to find the exact form of the 
distribution of X. One can, however, use the CLT (§1.5.4) to obtain an 
approximation to the distribution of X. Applying Corollary 1.2 (for the 
case of k = 1), we obtain that 


Vil X — 2) +4 N(0, 0?) 


and, by (1.100), the distribution of X can be approximated by N(j1,07/n), 
where ys and o? are the mean and variance of P, respectively, and are 
assumed to be finite. 

Compared to X, the distribution of S? is harder to obtain. Assuming 
that P is N(u,07), one can show that (n — 1)$?/o? has the chi-square 


102 2. Fundamentals of Statistics 


distribution x3_, (see Example 2.18). An approximate distribution for 
5S? can be obtained from the approximate joint distribution of X and S$? 
discussed next. 


Under the assumption that P is N(y,o7), it can be shown that X 
and $? are independent (Example 2.18). Hence, the joint distribution of 
(X, $7) is the product of the marginal distributions of X and S$? given in the 
previous discussion. Without the normality assumption, an approximate 
joint distribution can be obtained as follows. Assume again that p = EX), 

2 = Var(X1), and E|X,|* are finite. Let Y; = (Xi — uw, (Xi — p)?), i 
1,...,n. Then Yj,..., Y, are ii.d. random 2-vectors with BY, = 0, a) aiid 
sariancecovarinnes mains 


o? E(X1 — p)3 
aa ( E(X,-y)* E(X,—p)* - 0% ). 


Note that Y =n-!9"_, ¥; = (X—y, $8), where $? =n“! 7“? (Xi —p)?. 
Applying the CLT (Corollary 1.2) to Y;’s, we obtain that 
Vn(X —#P, iS? i a”) —~d N2(0, »). 


Since 
S2 = 


n—1 [57 - (% - u) 


and X 4.5, (the SLLN, Theorem 1.13), an application of Slutsky’s 
theorem (Theorem 1.11) leads to 


Vn(X — py, 8? — 07) ~d N2(0, ©). i 


Example 2.9 (Order statistics). Let X = (Xq,..., Xn) with iid. random 
components and let X(;) be the 7th smallest value of X1,..., Xn. The statis- 
tics X(1), .--, X(n) are called the order statistics, which is a set of very useful 
statistics in addition to the sample mean and variance in the previous ex- 
ample. Suppose that X; has a c.d.f. F having a Lebesgue p.d.f. f. Then 
the joint Lebesgue p.d.f. of X(1),...-, Xm) is 


== mn! f (v1) f (a2) +++ f (an) Ly < XQ <i <M Ly 
een { 0 otherwise. 


The joint Lebesgue p.d.f. of X (i) and X( 1l<i<j<n,is 


(9)? 
ar @) PG) Fak Ra For) 
Gi,5(@, Y) = (t-1)!(g-7-1)!(n—-9)! a<y 
| 0 otherwise 


and the Lebesgue p.d.f. of X(,) is 


n! #11] _ B(p)1"-* F(x 
K@) = Fa PO [L— F(x)" f(x). 0 


2.2. Statistics, Sufficiency, and Completeness 103 


2.2.2 Sufficiency and minimal sufficiency 


Having discussed the reduction of the o-field o(X) by using a statistic 
T(X), we now ask whether such a reduction results in any loss of infor- 
mation concerning the unknown population. If a statistic T(X) is fully as 
informative as the original sample X, then statistical analyses can be done 
using T'(X) that is simpler than X. The next concept describes what we 
mean by fully informative. 


Definition 2.4 (Sufficiency). Let X be a sample from an unknown pop- 
ulation P € P, where P is a family of populations. A statistic T(X) is 
said to be sufficient for P € P (or for 0 € O when P={Pp: 0€ Of isa 
parametric family) if and only if the conditional distribution of X given T 
is known (does not depend on P or @). I 


Definition 2.4 can be interpreted as follows. Once we observe X and 
compute a sufficient statistic T(X), the original data X do not contain any 
further information concerning the unknown population P (since its con- 
ditional distribution is unrelated to P) and can be discarded. A sufficient 
statistic T(X) contains all information about P contained in X (see Ex- 
ercise 36 in §3.6 for an interpretation of this from another viewpoint) and 
provides a reduction of the data if T is not one-to-one. Thus, one of the 
questions raised in Example 2.1 can be answered as follows: it is enough to 
just look at Z and s? for the problem of measuring 0 if (X, $7) is sufficient 
for P (or @ when @ is the only unknown parameter). 


The concept of sufficiency depends on the given family P. If T is suffi- 
cient for P € P, then T is also sufficient for P € Po C P but not necessarily 
sufficient for P € P, DP. 


Example 2.10. Suppose that X = (X1,...,X») and X4,...,X, are iid. 
from the binomial distribution with the p.d.f. (w.r.t. the counting measure) 


fo(z) = (1-0)'*Ioiy(z), 2ER, 0€ (0,1). 


For any realization x of X, x is a sequence of n ones and zeros. Consider 
the statistic T(X) = 3°", Xi, which is the number of ones in X. Before 
showing that T is sufficient, we can intuitively argue that T contains all 
information about 6, since @ is the probability of an occurrence of a one 
in x. Given T = t (the number of ones in x), what is left in the data set 
x is the redundant information about the positions of t ones. Since the 
random variables are discrete, it is not difficult to compute the conditional 
distribution of X given T = t. Note that 


P(X =2,T =t) 


P(X =a|T =1) =a 


104 2. Fundamentals of Statistics 


and P(T = t) = (")0°(1—0)"'Iyo.1,.. ny(t). Let x; be the ith component 
Orgel tA ys 24 oe then POR ep) Oy ea Se a then 


PX =2,7 =1) =|] PXr=2) =0' 01-0 |] hoy). 


i=1 i=1 


Let(B; ={ (Gi; n5 8) 2 SH 01, oe ST}: Then 


1 
P(X =2|T =t) = zy Iz, (2) 
(7) 
is a known p.d.f. This shows that T(X) is sufficient for 6 € (0,1), according 
to Definition 2.4 with the family {fg:9¢€(0,1)}. 


Finding a sufficient statistic by means of the definition is not conve- 
nient since it involves guessing a statistic T that might be sufficient and 
computing the conditional distribution of X given T = t. For families of 
populations having p.d.f.’s, a simple way of finding sufficient statistics is to 
use the factorization theorem. We first prove the following lemma. 


Lemma 2.1. If a family P is dominated by a o-finite measure, then P is 
dominated by a probability measure Q = )>>*, c;P;, where c;’s are nonneg- 
ative constants with }°°°, c; = 1 and P; € P. 

Proof. Assume that P is dominated by a finite measure v (the case of 
o-finite v is left as an exercise). Let Po be the family of all measures of the 
form >>, ccP;, where P; € P, c; > 0, and 575°, c; = 1. Then, it suffices 
to show that there is a Q € Po such that Q(A) = 0 implies P(A) = 0 for all 
P€Po. Let C be the class of events C' for which there exists P € Po such 
that P(C) > 0 and dP/dv > 0 a.e. vy on C. Then there exists a sequence 
{Ci} CC such that v(C;) - supcec v(C). Let Co be the union of all C;’s 
and Q = 37°, cP, where P; is the probability measure corresponding to 
C;. Then Co € C (exercise). Suppose now that Q(A) = 0. Let P € Po 
and B = {x : dP/dv > 0}. Since Q(AN Co) = 0, vV(AN Co) = 0 and 
P(AN Co) =0. Then P(A) = P(ANC§N B). If P(ANC§NB) > 0, then 
v(CoU(ANC§NB)) > v(Co), which contradicts v(Co) = supgec v(C) since 
ANC§MB and therefore Co U(ANC§5N B) is in C. Thus, P(A) = 0 for all 
PEPo. tf 


Theorem 2.2 (The factorization theorem). Suppose that X is a sample 
from P € P and P is a family of probability measures on (R”, B”) dom- 
inated by a o-finite measure v. Then T(X) is sufficient for P € P if and 
only if there are nonnegative Borel functions h (which does not depend on 
P) on (R”,B”) and g, (which depends on P) on the range of T such that 


(0) = 9p (T(a)) A) (2.11) 


2.2. Statistics, Sufficiency, and Completeness 105 


Proof. (i) Suppose that T is sufficient for P € P. Then, for any A € B”, 
P(A|T) does not depend on P. Let Q be the probability measure in Lemma 
2.1. By Fubini’s theorem and the result in Exercise 35 of §1.6, 


Q(An B) = Lar (AM B) 


= Su f Bae 
= f pePuarrne, 


= | Prairjag 
B 


for any B € o(T). Hence, P(A|T) = Eg(La|T) as. Q, where Eg (La|T) 
denotes the conditional expectation of [4 given T w.r.t. Q. Let g,(T) be 
the Radon-Nikodym derivative dP/dQ on the space (R”,o(T),Q). From 


Propositions 1.7 and 1.10, 
= f Pralnyap 
= f Bo(talT)g.(T)AQ 
= f Foltag.(T)\PidQ 


for any A € B”. Hence, (2.11) holds with h = dQ/dv. 
(ii) Suppose that (2.11) holds. Then 


7 - = 
o/s = 9elT) [San (P) as Q 222) 
i=1 
where the second equality follows from the result in Exercise 35 of §1.6. Let 
A€éo(X) and PEP. The sufficiency of T follows from 
P(A|T) = Eg(IalT) as. P, (2.13) 


where Eg(I4|T) is given in part (i) of the proof. This is because Eg(I4|T) 
does not vary with P € P, and result (2.13) and Theorem 1.7 imply that 
the conditional distribution of X given T is determined by Eg(I4|T), A € 
o(X). By the definition of conditional probability, (2.13) follows from 


[var | Eg(Ia|T)dP (2.14) 
B B 


106 2. Fundamentals of Statistics 


for any B € o(T). Let B € o(T). By (2.12), dP/dQ is a Borel function of 
T. Then, by Proposition 1.7(i), Proposition 1.10(vi), and the definition of 
the conditional expectation, the right-hand side of (2.14) is equal to 


dP dP dP 
EgUa|T)—d -jx (usr) = f Zea. 
J, Battalty gga = f, Ba (ZIP) Ae= J, tn55%2 
which equals the left-hand side of (2.14). This proves (2.14) for any B € 
o(T) and completes the proof. 


If P is an exponential family with p.d.f.’s given by (2.4) and X(w) =w, 
then we can apply Theorem 2.2 with go(t) = exp{[n(0)]7t — €(0)} and 
conclude that T is a sufficient statistic for 9 € O. In Example 2.10 the joint 
distribution of X is in an exponential family with T(X) = )7i_, X;. Hence, 
we can conclude that T is sufficient for 6 € (0,1) without computing the 
conditional distribution of X given T. 


Example 2.11 (Truncation families). Let (x) be a positive Borel function 
on (R, B) such that le o(a)dx < co for any a and b, —-co <a <b < ow. 
Let 6 = (a,b), O = {(a,b) € R?: a < bd}, and 


fo(x) = c(0)6(a)I(a,b) (x), 


a 
where c(@) = [Ie o(x)de] . Then {fo : 0 € O}, called a truncation 


family, is a parametric family dominated by the Lebesgue measure on FR. 
Let Xj,...,X, be i.i.d. random variables having the p.d.f. fg. Then the 
joint p.d.f. of X = (X1,..., Xn) is 


n 


[| fo(wa) = [e(9)1" Z(a,00) (#(1) (00,8) (@(n)) |] o(@a), (2.15) 
i=1 i=1 
where #(;) is the ith smallest value of #1,...,a%. Let T(X) = (X 


go(ti, t2) = [e()]” T(a,00) (t1) L(—00,) (t2), and h(x) = []7_, ¢(ai). By (2.15) 
and Theorem 2.2, T'(X) is sufficient ford€ 0. I 


* 
3 
g 
3 


Example 2.12 (Order statistics). Let X = (X1,..., Xn) and X1,..., Xn be 
iid. random variables having a distribution P € P, where P is the family 
of distributions on R having Lebesgue p.d-f.’s. Let X(1),..-,X(n) be the 
order statistics given in Example 2.9. Note that the joint p.d.f. of X is 


f(@1) +++ f@n) = fea): F@q@)- 


Hence, T(X) = (X1),---, X(n)) is sufficient for P € P. The order statistics 
can be shown to be sufficient even when P is not dominated by any o-finite 
measure, but Theorem 2.2 is not applicable (see Exercise 31 in §2.6). Il 


2.2. Statistics, Sufficiency, and Completeness 107 


There are many sufficient statistics for a given family P. In fact, if 
T is a sufficient statistic and T = w(S), where w is measurable and S is 
another statistic, then S is sufficient. This is obvious from Theorem 2.2 if 
the population has a p.d.f., but it can be proved directly from Definition 
2.4 (Exercise 25). For instance, in Example 2.10, (S0j2) Xi, Djm41 Xi) 
is sufficient for 6, where m is any fixed integer between 1 and n. If T 
is sufficient and T = 7(S) with a measurable w that is not one-to-one, 
then o(T) C a(S) and T is more useful than S, since T provides a further 
reduction of the data (or o-field) without loss of information. Is there a 
sufficient statistic that provides “maximal” reduction of the data? 


Before introducing the next concept, we need the following notation. If 
a statement holds except for outcomes in an event A satisfying P(A) = 0 
for all P € P, then we say that the statement holds a.s. P. 


Definition 2.5 (Minimal sufficiency). Let T be a sufficient statistic for 
PEP. T iscalled a minimal sufficient statistic if and only if, for any other 
statistic S sufficient for P € P, there is a measurable function w such that 
T=y(S)as.P. I 


If both T and S' are minimal sufficient statistics, then by definition there 
is a one-to-one measurable function w such that T = ~(S) a.s. P. Hence, 
the minimal sufficient statistic is unique in the sense that two statistics 
that are one-to-one measurable functions of each other can be treated as 
one statistic. 


Example 2.13. Let Xj,...,X, be i.i.d. random variables from Pg, the 
uniform distribution U(0,6 +1), 8 € R. Suppose that n > 1. The joint 
Lebesgue p.d.f. of (X1,..., Xn) is 


fo(2) = [[ e.e+n(@) ae le ie) t= (ie gtr) € ikon 
t=1 


where x(;) denotes the ith smallest value of 71,...,%,. By Theorem 2.2, 
T = (Xa); X(m)) is sufficient for 0. Note that 


£1) = sup{O: fo(z) >O} and a) =1+ inf{O: fo(x) > O}. 


If S(X) is a statistic sufficient for 0, then by Theorem 2.2, there are Borel 
functions h and gg such that fo(x) = go(S(x))h(x). For x with h(x) > 0, 


£1) = sup{O: go(S(x)) > O} and 2m) = 1+ inf{@: go(S(x)) > Of. 


Hence, there is a measurable function w such that T(x) = w(S(a#)) when 
h(a) > 0. Since h > 0 a.s. P, we conclude that T is minimal sufficient. I 


108 2. Fundamentals of Statistics 


Minimal sufficient statistics exist under weak assumptions, e.g., P con- 
tains distributions on R* dominated by a o-finite measure (Bahadur, 1957). 
The next theorem provides some useful tools for finding minimal sufficient 
statistics. 


Theorem 2.3. Let P be a family of distributions on R*. 

(i) Suppose that Po C P and a.s. Po implies a.s. P. If T is sufficient for 
P €P and minimal sufficient for P € Po, then T is minimal sufficient for 
PEP. 

(ii) Suppose that P contains p.d-f’s fo, fi, fo,..., wrt. a o-finite mea- 
sure. Let foo(t) = >) cifi(x), where c; > 0 for all i and 7?,c; = 1, 
and let T;(X) = fi(x)/foo(x) when fo(#) > 0, ¢ = 0,1,2,..... Then 
T(X) = (10, Ti, To,...) is minimal sufficient for P € P. Furthermore, if 
{x: fi(z) > 0} C {x : fo(a) > 0} for all 7, then we may replace f.. by fo, 
in which case T(X) = (J), 7To,...) is minimal sufficient for P € P. 

(iii) Suppose that P contains p.d.f.’s f, w.r.t. a o-finite measure and that 
there exists a sufficient statistic T(X) such that, for any possible values x 
and y of X, f(x) = fp(y)¢(a, y) for all P implies T(x) = T(y), where ¢ 
is a measurable function. Then T'(X) is minimal sufficient for P € P. 
Proof. (i) If S is sufficient for P € P, then it is also sufficient for P € Po 
and, therefore, T = (S$) a.s. Po holds for a measurable function w. The 
result follows from the assumption that a.s. Po implies a.s. P. 

(ii) Note that fo. > Oas. P. Let g(T) = Ti, i = 0,1,2,..... Then 
filx) = gi(T(x)) foo(x) a.s. P. By Theorem 2.2, T is sufficient for P € P. 
Suppose that S$(X) is another sufficient statistic. By Theorem 2.2, there 
are Borel functions h and g; such that f;(x) = g;(S(a))h(x), i = 0,1, 2,.... 
Then T;(x) = gi(S(x))/ 7720 695 (S(x)) for a’s satisfying fo(x) > 0. By 
Definition 2.5, T is minimal sufficient for P € P. The proof for the case 
where fx. is replaced by fo is the same. 

(iii) From Bahadur (1957), there exists a minimal sufficient statistic S(X). 
The result follows if we can show that T(X) = w(S(X)) a.s. P for a mea- 
surable function ~. By Theorem 2.2, there are Borel functions g, and h 
such that f,(«) = gp(S(«))h(a) for all P. Let A = {x : h(a) = 0}. Then 
P(A) =0 for all P. For x and y such that S(x) = S(y), 2 ¢ A andy ¢ A, 


fp () = gp (S(x))h(a) 
= gp(S(y))A(@)h(y)/A(y) 
= fp(y)h(a)/h(y) 
for all P. Hence T(x) = T(y). This shows that there is a function w 
such that T(x) = y(S(a)) except for « € A. It remains to show that 
w is measurable. Since S is minimal sufficient, g(T(X)) = S(X) as. P 


for a measurable function g. Hence g is one-to-one and wy = g~'. The 
measurability of ~ follows from Theorem 3.9 in Parthasarathy (1967). 1 


2.2. Statistics, Sufficiency, and Completeness 109 


Example 2.14. Let P = {fo : 0 € O} be an exponential family with 
p.d.f.’s fg given by (2.4) and X(w) = w. Suppose that there exists Oo = 
{60, 91, ..., Op} C © such that the vectors n; = 7(@;) — n(@0), i= 1,...,p, are 
linearly independent in R?. (This is true if the family is of full rank.) We 
have shown that T(X) is sufficient for 6 € ©. We now show that T is in 
fact minimal sufficient for @ € ©. Let Po = {fo : 9 € Oo}. Note that the 
set {x : fo(a) > 0} does not depend on 6. It follows from Theorem 2.3(ii) 
with foo = fg that 


S(X) = (exp{n{T (2) - 1}, ...,exp{npT (x) — &}) 


is minimal sufficient for 6 € Oo, where €; = €(0;) — €(@0). Since 7;’s are 
linearly independent, there is a one-to-one measurable function w such that 
T(X) = W(S(X)) as. Po. Hence, T is minimal sufficient for 9 € Oo. It 
is easy to see that a.s. Po implies a.s. P. Thus, by Theorem 2.3(i), T is 
minimal sufficient for@€O. IF 


The results in Examples 2.13 and 2.14 can also be proved by using 
Theorem 2.3(iii) (Exercise 32). 


The sufficiency (and minimal sufficiency) depends on the postulated 
family P of populations (statistical models). Hence, it may not be a useful 
concept if the proposed statistical model is wrong or at least one has some 
doubts about the correctness of the proposed model. From the examples 
in this section and some exercises in §2.6, one can find that for a wide 
variety of models, statistics such as X in (2.1), S? in (2.2), (X(1),X(my) in 
Example 2.11, and the order statistics in Example 2.9 are sufficient. Thus, 
using these statistics for data reduction and summarization does not lose 
any information when the true model is one of those models but we do not 
know exactly which model is correct. 


2.2.3 Complete statistics 


A statistic V(X) is said to be ancillary if its distribution does not depend 
on the population P and first-order ancillary if E[V(X)] is independent 
of P. A trivial ancillary statistic is the constant statistic V(X) = c € 
R. If V(X) is a nontrivial ancillary statistic, then o(V(X)) C o(X) isa 
nontrivial o-field that does not contain any information about P. Hence, 
if S(X) is a statistic and V(S(X)) is a nontrivial ancillary statistic, it 
indicates that o(.$(X)) contains a nontrivial o-field that does not contain 
any information about P and, hence, the “data” S(X) may be further 
reduced. A sufficient statistic T appears to be most successful in reducing 
the data if no nonconstant function of T is ancillary or even first-order 
ancillary. This leads to the following concept of completeness. 


110 2. Fundamentals of Statistics 


Definition 2.6 (Completeness). A statistic T(X) is said to be complete 
for P € P if and only if, for any Borel f, E|,f(T)] = 0 for all P € P implies 
f(T) =O0as. P. T is said to be boundedly complete if and only if the 
previous statement holds for any bounded Borel f. 


A complete statistic is boundedly complete. If T is complete (or bound- 
edly complete) and S = (7) for a measurable w, then S is complete (or 
boundedly complete). Intuitively, a complete and sufficient statistic should 
be minimal sufficient, which was shown by Lehmann and Scheffé (1950) and 
Bahadur (1957) (see Exercise 48). However, a minimal sufficient statistic 
is not necessarily complete; for example, the minimal sufficient statistic 
(X(1), X(m)) in Example 2.13 is not complete (Exercise 47). 


Proposition 2.1. If P is in an exponential family of full rank with p.d.f.’s 
given by (2.6), then T(X) is complete and sufficient for 7 € &. 

Proof. We have shown that T is sufficient. Suppose that there is a function 
f such that E[f(L)] = 0 for all 7 € &. By Theorem 2.1(i), 


[ro exp{n’t — ¢(n)}dA = 0 for ally € =, 


where A is a measure on (R?, B”). Let no be an interior point of =. Then 


[tera = lee (t)e” ‘dd for all 7 € N(n0), (2.16) 


where N(n0) = {n € R® : ||n — nol] < e} for some € > 0. In particular, 
[ feeritan = [ fwertan =¢. 


If c= 0, then f = 0 ae. X. If c > 0, then cl f,(t)e”! and cl f_(t)e™! 
are p.d.f.’s w.r.t. A and (2.16) implies that their m.g.f.’s are the same in a 
neighborhood of 0. By Theorem 1.6(ii), e~!f,(t)e™* = c71 f_(t)e™", ie, 
f=f+—f-=0ae. A. Hence T is complete. I 


Proposition 2.1 is useful for finding a complete and sufficient statistic 
when the family of distributions is an exponential family of full rank. 


Example 2.15. Suppose that X,..., X, are i.i.d. random variables having 
the N(,07) distribution, 1 € R, o > 0. From Example 2.6, the joint p.d-f. 
of X1,...,Xn is (20)~-"/? exp {mT + n2T2 — n€(n)}, where T; = Y77_, Xi, 
T2 = —>o_, X?, and n = (m,m2) = (4,542). Hence, the family of 
distributions for X = (Xj,...,X,) is a natural exponential family of full 
rank (= = R x (0,co)). By Proposition 2.1, T(X) = (T),T2) is complete 
and sufficient for 7. Since there is a one-to-one correspondence between 17 


2.2. Statistics, Sufficiency, and Completeness 111 


and 6 = (u,07), T is also complete and sufficient for @. It can be shown that 
any one-to-one measurable function of a complete and sufficient statistic 
is also complete and sufficient (exercise). Thus, (X,S7) is complete and 
sufficient for 0, where X and S$? are the sample mean and variance given 
by (2.1) and (2.2), respectively. I 


The following examples show how to find a complete statistic for a non- 
exponential family. 


Example 2.16. Let Xj,...,X, be i.i.d. random variables from Pg, the 
uniform distribution U(0,0), 6 > 0. The largest order statistic, X(n), is 
complete and sufficient for 6 € (0,00). The sufficiency of X(,) follows from 
the fact that the joint Lebesgue p.d.f. of X1,..., Xn is 07" [(0,9)(@(m))- From 
Example 2.9, X(,) has the Lebesgue p.d.f. (nv"~!/0")I(9,9)(x) on R. Let f 
be a Borel function on [0,00) such that E[f(X(,))] = 0 for all @ > 0. Then 


0 
| f(a)x""'dx =0 for all 6 > 0. 
0 


Let G(@) be the left-hand side of the previous equation. Applying the result 
of differentiation of an integral (see, e.g., Royden (1968, §5.3)), we obtain 
that G’(0) = f(0)0""! ae. m;, where m; is the Lebesgue measure on 
({0, 00), Bio,o0)). Since G(9) = 0 for all 9 > 0, f(0)0"~* = 0 ae. my and, 
hence, f(x) = 0 a.e. m;. Therefore, X(,) is complete and sufficient for 
6€(0,c). Jt 


Example 2.17. In Example 2.12, we showed that the order statistics 
T(X) = (Xqy, +) X(n)) Of iid. random variables X1,..., Xn is sufficient 
for P € P, where P is the family of distributions on R having Lebesgue 
p.d.f.’s. We now show that T(X) is also complete for P € P. Let Po be 
the family of Lebesgue p.d.f.’s of the form 


f(x) = C(A, way On) exp{—a?” + 01x + 02x” a Ona” }, 


where 0; € R and C(64,..., On) is a normalizing constant such that [ f(x)dx 
= 1. Then Po C P and Pp is an exponential family of full rank. Note that 
the joint distribution of X = (Xj,...,X,) is also in an exponential family of 
full rank. Thus, by Proposition 2.1, U = (U,,...,Un) is a complete statistic 
for P € Po, where U; = Soy, X}. Since a.s. Po implies a.s. P, U(X) is 
also complete for P € P. 

The result follows if we can show that there is a one-to-one correspon- 
dence between T(X) and U(X). Let Vi = Soy, Xi, Vo = je; XiX;, 
V3 = Dieses Xi X;Xph,..., Vn = X1-++Xn. From the identities 


i<j 


Up, — ViUg_1 + VoUp_2 — + + (—1)P-1V_1U1 + (—1)" kV, = 0, 


112 2. Fundamentals of Statistics 


k = 1,...,n, there is a one-to-one correspondence between U(X) and 
V(X) = (Vi,..., Vn). From the identity 


(t — Xy)-+-(¢- Xn) =" — Yt? 1 + Vot® 2 — +--+ (-1)"Vp, 


there is a one-to-one correspondence between V(X) and T(X). This com- 
pletes the proof and, hence, T(X) is sufficient and complete for P € P. In 
fact, both U(X) and V(X) are sufficient and complete for PE P. I 


The relationship between an ancillary statistic and a complete and suf- 
ficient statistic is characterized in the following result. 


Theorem 2.4 (Basu’s theorem). Let V and T be two statistics of X from 
a population P € P. If V is ancillary and T is boundedly complete and 
sufficient for P € P, then V and T are independent w.r.t. any P € P. 
Proof. Let B be an event on the range of V. Since V is ancillary, 
P(V~1(B)) is a constant. Since T is sufficient, E[Ig(V)|T] is a func- 
tion of T (independent of P). Since E{E[Ig(V)|T] — P(V~1(B))} = 0 
for all Pe P, P(V~1(B)|T) = E[[p(V)|T] = P(V~1(B)) as. P, by the 
bounded completeness of T’. Let A be an event on the range of T. Then, 
P(T-1(A) NV-1(B)) = E{E[La(T)a(V)IT}} = E{La(T)EUEa(V)ITH = 
E{I4(T)P(V~1(B))} = P(T~1(A))P(V~1(B)). Hence T and V are inde- 
pendent w.r.t.any PEP. I 


Basu’s theorem is useful in proving the independence of two statistics. 


Example 2.18. Suppose that Xj, ..., X, are i.i.d. random variables having 
the N(y, 07) distribution, with » € R and a known o > 0. It can be easily 
shown that the family {N(y,07) : w € R} is an exponential family of full 
rank with natural parameter 7 = /o?. By Proposition 2.1, the sample 
mean X in (2.1) is complete and sufficient for 7 (and yp). Let S$? be the 
sample variance given by (2.2). Since S$? = (n—1)~' 0}, (Z;-—Z)?, where 
Z; = X;-pis N(0,07) and Z =n! 7", Z;, S? is an ancillary statistic (0? 
is known). By Basu’s theorem, X and S$? are independent w.r.t. N (1,07) 
with pz € R. Since o? is arbitrary, X and $? are independent w.r.t. N(j1, 07) 
for any w € R and o? > 0. 


Using the independence of X and $?, we now show that (n — 1)S?/o? 
has the chi-square distribution x2_,. Note that 


oe ee 


i=1 


From the properties of the normal distributions, n(X — y)?/o? has the chi- 
square distribution x7 with the m.g.f. (1 — 2¢)~'/? and $7“, (Xi — p)?/o? 


2.3. Statistical Decision Theory 113 


has the chi-square distribution y2, with the m.g.f. (1—2t)~"/?, t < 1/2. By 
the independence of X and $?, the m.g.f. of (n — 1),$?/o? is 


(ie 87 aS Oe 


for t < 1/2. This is the m.g.f. of the chi-square distribution y2_, and, 
therefore, the result follows. I 


2.3 Statistical Decision Theory 


In this section, we describe some basic elements in statistical decision the- 
ory. More developments are given in later chapters. 


2.3.1 Decision rules, loss functions, and risks 


Let X be a sample from a population P € P. A statistical decision is an 
action that we take after we observe X, for example, a conclusion about P 
or a characteristic of P. Throughout this section, we use A to denote the 
set of allowable actions. Let F, be a o-field on A. Then the measurable 
space (A, F,) is called the action space. Let X be the range of X and Fx 
be a o-field on X. A decision rule is a measurable function (a statistic) T 
from (X, Fx) to (A, F,). If a decision rule T is chosen, then we take the 
action T(X) € A whence X is observed. 


The construction or selection of decision rules cannot be done without 
any criterion about the performance of decision rules. In statistical decision 
theory, we set a criterion using a loss function L, which is a function from 
P x A to [0,0o) and is Borel on (A, F,,) for each fixed P € P. If X = x is 
observed and our decision rule is T, then our “loss” (in making a decision) 
is L(P,T(«x)). The average loss for the decision rule T, which is called the 
risk of T, is defined to be 


Rr(P) = E[L(P,T(X))| = A L(P,T(«))dPx (2). (2.17) 
The loss and risk functions are denoted by L(0,a) and Rr(@) if P is a 
parametric family indexed by @. A decision rule with small loss is preferred. 
But it is difficult to compare L(P,T)(X)) and L(P,T2(X)) for two decision 
rules, TJ; and 75, since both of them are random. For this reason, the 
risk function (2.17) is introduced and we compare two decision rules by 
comparing their risks. A rule T\ is as good as another rule T> if and only if 


Rr,(P) < Rr,(P) for any PEP, (2.18) 


and is better than T> if and only if (2.18) holds and Rr, (P) < Rr,(P) for 
at least one P € P. Two decision rules T; and T> are equivalent if and only 


114 2. Fundamentals of Statistics 


if Rr,(P) = Rp,(P) for all P € P. If there is a decision rule T;. that is as 
good as any other rule in S, a class of allowable decision rules, then T), is 
said to be S-optimal (or optimal if S contains all possible rules). 


Example 2.19. Consider the measurement problem in Example 2.1. Sup- 
pose that we need a decision on the value of @ € R, based on the sample 
X = (X,..., Xn). If © is all possible values of 6, then it is reasonable to 
consider the action space (A, F,,) = (0, Be). An example of a decision rule 
is T(X) = X, the sample mean defined by (2.1). A common loss function 
in this problem is the squared error loss L(P,a) = (@—a)?, a € A. Then 
the loss for the decision rule X is the squared deviation between X and 0. 
Assuming that the population has mean py and variance a? < oo, we obtain 
the following risk function for X: 


Rz(P) = E(6 — X)? 
(0— BX)? + E(EX — X)? 
= (0— EX) + Var(X) (2.19) 
=(u-0)? +2, (2.20) 


where result (2.20) follows from the results for the moments of X in Exam- 
ple 2.8. If 6 is in fact the mean of the population, then the first term on 
the right-hand side of (2.20) is 0 and the risk is an increasing function of 
the population variance o? and a decreasing function of the sample size n. 


Consider another decision rule Ti(X) = (X(q) + X(n))/2. However, 
Rr,(P) does not have an explicit form if there is no further assumption on 
the population P. Suppose that P € P. Then, for some P, X (or T;) is 
better than T; (or X) (exercise), whereas for some P, neither X nor T; is 
better than the other. 


A different loss function may also be considered. For example, L(P, a) = 
|0—al, which is called the absolute error loss. However, Rx(P) and Rr, (P) 
do not have explicit forms unless P is of some specific form. If 


The problem in Example 2.19 is a special case of a general problem called 
estimation, in which the action space is the set of all possible values of a 
population characteristic 7 to be estimated. In an estimation problem, a 
decision rule T is called an estimator and result (2.19) holds with 0 = J and 
X replaced by any estimator with a finite variance. The following example 
describes another type of important problem called hypothesis testing. 


Example 2.20. Let P be a family of distributions, Pp C P, and P, = 
{PeP:P¢Po}. A hypothesis testing problem can be formulated as that 
of deciding which of the following two statements is true: 


Ho: PE Po versus Ay: PeEP,. (2.21) 


2.3. Statistical Decision Theory 115 


Here, Ho is called the null hypothesis and Hy, is called the alternative hy- 
pothesis. The action space for this problem contains only two elements, i.e., 
A = {0,1}, where 0 is the action of accepting Hp and 1 is the action of 
rejecting Ho. A decision rule is called a test. Since a test T(X) is a function 
from X to {0,1}, T(X) must have the form I¢(X), where C' € Fx is called 
the rejection region or critical region for testing Ho versus H,. 


A simple loss function for this problem is the 0-1 loss: L(P,a) = 0 
if a correct decision is made and 1 if an incorrect decision is made, i.e., 
L(P,j) =0 for P € P; and L(P, 7) = 1 otherwise, 7 = 0,1. Under this loss, 
the risk is 
Rir(P) = { P(T(X) =1) = P(X €C) PEPo 
P(T(X) =0) = P(X ¢C) PEP. 


See Figure 2.2 on page 127 for an example of a graph of Rr(@) for some T 
and P in a parametric family. 


The 0-1 loss implies that the loss for two types of incorrect decisions 
(accepting Hp when P € P, and rejecting Hp when P € Po) are the same. 
In some cases, one might assume unequal losses: L(P,j) = 0 for P € Py, 
L(P,0) = co when P € Pi, and L(P,1)=c, when PE Po. I 


In the following example the decision problem is neither an estimation 
nor a testing problem. Another example is given in Exercise 93 in §2.6. 


Example 2.21. A hazardous toxic waste site requires clean-up when the 
true chemical concentration # in the contaminated soil is higher than a given 
level 09 > 0. Because of the limitation in resources, we would like to spend 
our money and efforts more in those areas that pose high risk to public 
health. In a particular area where soil samples are obtained, we would 
like to take one of these three actions: a complete clean-up (a1), a partial 
clean-up (a2), and no clean-up (a3). Then A = {a1,a2,a3}. Suppose that 
the cost for a complete clean-up is c; and for a partial clean-up is cg < ¢1; 
the risk to public health is c3(@ — 0) if @ > @ and 0 if 6 < 69; a complete 
clean-up can reduce the toxic concentration to an amount < 09, whereas a 
partial clean-up can only reduce a fixed amount of the toxic concentration, 
i.e., the chemical concentration becomes 6—¢ after a partial clean-up, where 
t is a known constant. Then the loss function is given by 


0 < 4 Cl c2 0 


A69<A0<%+t|] aq C2 c3(0 — A) 
O0>O0.+t C1 co +.¢3(8 — 9 — t) c3(0 — A) 


The risk function can be calculated once the decision rule is specified. We 
discuss this example again in Chapter 4. If 


116 2. Fundamentals of Statistics 


Sometimes it is useful to consider randomized decision rules. Examples 
are given in §2.3.2, Chapters 4 and 6. A randomized decision rule is a 
function 6 on X x F, such that, for every A € F,, 6(-, A) is a Borel function 
and, for every « € X, 5(a,-) is a probability measure on (A, F,). To choose 
an action in A when a randomized rule 6 is used, we need to simulate a 
pseudorandom element of A according to 6(a,-). Thus, an alternative way 
to describe a randomized rule is to specify the method of simulating the 
action from A for each x € X. If A is a subset of a Euclidean space, for 
example, then the result in Theorem 1.7(ii) can be applied. Also, see §7.2.3. 


A nonrandomized decision rule T previously discussed can be viewed 
as a special randomized decision rule with 6(a, {a}) = Iq, (T(a)), a € A, 
x € X. Another example of a randomized rule is a discrete distribution 
6(a,-) assigning probability p;(x) to a nonrandomized decision rule T;(z), 
j = 1,2,..., in which case the rule 6 can be equivalently defined as a rule 
taking value T;(x) with probability p;(x). See Exercise 64 for an example. 


The loss function for a randomized rule 6 is defined as 


L(P,6,x) = [se a)dd(x, a), 
A 

which reduces to the same loss function we discussed when 6 is a nonran- 

domized rule. The risk of a randomized rule 6 is then 


Rs(P) = E[L(P,6, X)] = I i L(P,a)d6(x,a)dPx (cr). (2.22) 


2.3.2 Admissibility and optimality 


Consider a given decision problem with a given loss L(P, a). 


Definition 2.7 (Admissibility). Let S be a class of decision rules (ran- 
domized or nonrandomized). A decision rule T € & is called S-admissible 
(or admissible when & contains all possible rules) if and only if there does 
not exist any S € & that is better than T (in terms of the risk). 


If a decision rule T is inadmissible, then there exists a rule better than T. 
Thus, T should not be used in principle. However, an admissible decision 
rule is not necessarily good. For example, in an estimation problem a silly 
estimator T(X) = a constant may be admissible (Exercise 71). 


The relationship between the admissibility and the optimality defined in 
§2.3.1 can be described as follows. If T;, is S-optimal, then it is S-admissible; 
if T, is S-optimal and Jo is S-admissible, then Tp is also S-optimal and is 
equivalent to T..; if there are two S-admissible rules that are not equivalent, 
then there does not exist any S-optimal rule. 


2.3. Statistical Decision Theory 117 


Suppose that we have a sufficient statistic T(X) for P € P. Intuitively, 
our decision rule should be a function of T, based on the discussion in 
§2.2.2. This is not true in general, but the following result indicates that 
this is true if randomized decision rules are allowed. 


Proposition 2.2. Suppose that A is a subset of R*. Let T(X) be a 
sufficient statistic for P € P and let d9 be a decision rule. Then 
61(t, A) = E[do(X, A)|T = ¢], (2.23) 


which is a randomized decision rule depending only on T’, is equivalent to 
do if Rs,(P) < co for any Pe P. 

Proof. Note that 6, defined by (2.23) is a decision rule since 6, does not 
depend on the unknown P by the sufficiency of T. From (2.22), 


Rs,(P) = at / L(P,a)d5,(X, a} 
7} 


A 


=F {E [fae a)d5o(X, a) 


A 


= Bt [ ue, a)ddo(X, a} 


_ R5,(P), 


where the proof of the second equality is left to the reader. I 


Note that Proposition 2.2 does not imply that 59 is inadmissible. Also, 
if d9 is a nonrandomized rule, 


51(t, A) = ElL4(50(X))|T = t] = P(5o(X) € AIT = t) 


is still a randomized rule, unless d9(X) = h(T(X)) a.s. P for some Borel 
function h (Exercise 75). Hence, Proposition 2.2 does not apply to situa- 
tions where randomized rules are not allowed. 


The following result tells us when nonrandomized rules are all we need 
and when decision rules that are not functions of sufficient statistics are 
inadmissible. 


Theorem 2.5. Suppose that A is a convex subset of R* and that for any 
P&P, L(P,a) is a convex function of a. 

(i) Let 5 be a randomized rule satisfying f, ||a||\dd(x,a) < oo for any 
« € X and let Ti(x) = f, add(x,a). Then L(P,T\(x)) < L(P,6,x) (or 
L(P,Ti(x))< L(P, 6, x) if L is strictly convex in a) for any eX and PEP. 
(ii) (Rao-Blackwell theorem). Let T be a sufficient statistic for P € P, Ty € 
R* be a nonrandomized rule satisfying E||To|| < oo, and T; = E[T(X)|T}. 
Then Rr, (P) < Rx (P) for any P € P. If L is strictly convex in a and Tp 
is not a function of T, then Jo is inadmissible. 


118 2. Fundamentals of Statistics 


The proof of Theorem 2.5 is an application of Jensen’s inequality (1.47) 
and is left to the reader. 


The concept of admissibility helps us to eliminate some decision rules. 
However, usually there are still too many rules left after the elimination 
of some rules according to admissibility and sufficiency. Although one is 
typically interested in a S-optimal rule, frequently it does not exist, if S is 
either too large or too small. The following examples are illustrations. 


Example 2.22. Let Xj,..., X, bei.i.d. random variables from a population 
P&P that is the family of populations having finite mean yu and variance 
o”. Consider the estimation of . (A = R) under the squared error loss. It 
can be shown that if we let S be the class of all possible estimators, then 
there is no S-optimal rule (exercise). Next, let Sj be the class of all linear 
functions in X = (Xq,...,Xn), ie., T(X) = Oy, Xi with known q € R, 
i =1,...,n. It follows from (2.19) and the discussion after Example 2.19 
that 


n 


Rr(P) = (>: Ci — :) +o" oy oe (2.24) 


i=1 


We now show that there does not exist T, = )7"_, c} X; such that Rr, (P) 
< Rr(P) for any P € P and T € Qj. If there is such a T,, then (cf, ...,c%) 
is a minimum of the function of (c1,...,¢n) on the right-hand side of (2.24). 
Then c¥,...,c*, must be the same and equal to p?/(¢? +n”), which depends 
on P. Hence T, is not a statistic. This shows that there is no 9-optimal 
rule. 

Consider now a subclass $2 C 94 with c;’s satisfying 7", ¢; = 1. From 
(2.24), Rr(P) = 0? iL, c? if T € So. Minimizing 0? S7i_, c? subject to 
eee. el leads to an optimal solution of c; = n—/ for all i. Thus, the 
sample mean X is S'2-optimal. 

There may not be any optimal rule if we consider a small class of decision 
rules. For example, if $3 contains all the rules in Sg except X, then one 
can show that there is no S3-optimal rule. J 


Example 2.23. Assume that the sample X has the binomial distribution 
Bi(0,n) with an unknown 6 € (0,1) and a fixed integer n > 1. Consider the 
hypothesis testing problem described in Example 2.20 with Ho : @ € (0, 40] 
versus H; : 0 € (9,1), where 09 € (0,1) is a fixed value. Suppose that we 
are only interested in the following class of nonrandomized decision rules: 
Y= {Tj : 7 =0,1,...,.2-1}, where T)(X) = I¢;41,...n}(X). From Example 
2.20, the risk function for 7; under the 0-1 loss is 


Rr, (0) — P(X >= J)L (0,0) (9) + P(X < J)L(05,1)(9)- 


2.3. Statistical Decision Theory 119 


For any integers k and 7,0<k<j<n-1l, 


—P(k<X <j) <0 0<A0<% 


Rr) R@)={ pac x ced 00<6<1. 


Hence, neither T; nor Ty is better than the other. This shows that every 
T; is S-admissible and, thus, there is no S-optimal rule. #1 


In view of the fact that an optimal rule often does not exist, statisticians 
adopt the following two approaches to choose a decision rule. The first 
approach is to define a class & of decision rules that have some desirable 
properties (statistical and/or nonstatistical) and then try to find the best 
rule in S. In Example 2.22, for instance, any estimator T in Sg has the 
property that T is linear in X and E[T(X)] = w. In a general estimation 


problem, we can use the following concept. 


Definition 2.8 (Unbiasedness). In an estimation problem, the bias of an 
estimator T(X) of a real-valued parameter 0 of the unknown population 
is defined to be br(P) = E[T(X)] — V0 (which is denoted by br(0) when P 
is in a parametric family indexed by @). An estimator T(X) is said to be 
unbiased for J if and only if br(P) =0 for any PEP. It 


Thus, Sz in Example 2.22 is the class of unbiased estimators linear in 
X. In Chapter 3, we discuss how to find a S-optimal estimator when & is 
the class of unbiased estimators or unbiased estimators linear in X. 


Another class of decision rules can be defined after we introduce the 
concept of invariance. 


Definition 2.9 Let X be a sample from P € P. 

(i) A class G of one-to-one transformations of X is called a group if and 
only if g; € G implies giogz € G and a EG. 

(ii) We say that P is invariant under G if and only if g(Px) = Pgx) isa 
one-to-one transformation from P onto P for each g € G. 

(iii) A decision problem is said to be invariant if and only if P is invari- 
ant under G and the loss L(P,a) is invariant in the sense that, for ev- 
ery g € G and every a € A, there exists a unique g(a) € A such that 
L(Px,a) = L (P,x),9(a)). (Note that g(X) and g(a) are different func- 
tions in general.) 

(iv) A decision rule T(x) is said to be invariant if and only if, for every 
g €G and every x € X, T(g(x)) =g(T(a)). I 


Invariance means that our decision is not affected by one-to-one trans- 
formations of data. 


In a problem where the distribution of X is in a location-scale family 


120 2. Fundamentals of Statistics 


P on R*, we often consider location-scale transformations of data X of the 
form g(X) = AX +c, where c€ C C R* and A € T, a class of invertible 
k x k matrices. Assume that if A; €¢ 7, i = 1,2, then ie € T and 
A,Ao € T, and that if c; €C,i = 1,2, then —c; € C and Acy + cg € C for 
any A € 7. Then the collection of all transformations is a group. A special 
case is given in the following example. 


Example 2.24. Let X have i.i.d. components from a population in a 
location family P = {P, : w € R}. Consider the location transformation 
gce(X) = X+cJ,, where c € R and Jy is the k-vector whose components are 
all equal to 1. The group of transformation is G = {g.: c € R}, which is a 
location-scale transformation group with T = {J,} and C = {cJ, : ce R}. 
P is invariant under G with 9-(P,.) = Py+c. For estimating u under the loss 
L(u,a) = L(u—a), where L(-) is a nonnegative Borel function, the decision 
problem is invariant with g-(4) = a+ c. A decision rule T is invariant if 
and only if T(x +cJ,) = T(x) +c for every x € R* andc€ R. An example 
of an invariant decision rule is T(2) = Ix for some | € R* with I7 J, = 1. 
Note that T(x) = 17x with 17 J, = 1 is in the class Sz in Example 2.22. I 


In §4.2 and 86.3, we discuss the problem of finding a S-optimal rule 
when & is a class of invariant decision rules. 


The second approach to finding a good decision rule is to consider some 
characteristic Rr of Rr(P), for a given decision rule T, and then minimize 
Rr over T € S. The following are two popular ways to carry out this idea. 
The first one is to consider an average of Rr(P) over P € P: 


r, (Hl) = ip Rr(P)dll(P), 


where II is a known probability measure on (P,¥p) with an appropri- 
ate o-field Fp. r,(II) is called the Bayes risk of T w.r.t. IL If T, ¢ S 
and r,,(H) < r,(II) for any T € S, then T, is called a S-Bayes rule 
(or Bayes rule when & contains all possible rules) w.r.t. I. The second 
method is to consider the worst situation, ie., suppep Rr(P). If T, € S 
and suppep Rr, (P) < suppep Rr(P) for any T € S%, then T, is called a 
$-minimaz rule (or minimax rule when S contains all possible rules). Bayes 
and minimax rules are discussed in Chapter 4. 


Example 2.25. We usually try to find a Bayes rule or a minimax rule in a 


parametric problem where P = Po for a 0 € R*. Consider the special case 
of k= 1 and L(0,a) = (6 — a)”, the squared error loss. Note that 


(0) = i: B[6 — T(X)]all(6), 


2.3. Statistical Decision Theory 121 


which is equivalent to E[@ — T(X)]|?, where @ is a random variable having 
the distribution II and, given 8 = @, the conditional distribution of X is 
P9. Then, the problem can be viewed as a prediction problem for 0 using 
functions of X. Using the result in Example 1.22, the best predictor is 
E(0|X), which is the S-Bayes rule w.r.t. II with S being the class of rules 
T(X) satisfying E[T(X)]? < oo for any 0. 

As a more specific example, let X = (X1,..., Xn) with i.i.d. components 
having the N(y, 07) distribution with an unknown p = 6 € R and a known 
o”, and let II be the N(yo,0@) distribution with known po and o@. Then 
the conditional distribution of 6 given X = x is N(us(a),c?) with 


2 2 2 
a “20 Z and c=—9"° _ (295) 


ea) = nog + 0? fice nog + es noe + 0? 


(exercise). The Bayes rule w.r.t. II is E(@|X) = p(X). 
In this special case we can show that the sample mean X is S-minimax 
with & being the collection of all decision rules. For any decision rule T, 


sup Rr (6) > ip Ry(6)dI1(60) 


OER 
> i Ry, (0)at1(6) 


= E{{ 0— [x ( X))?} 

= E{E{[0— p.(X)]?|X}} 

= E(c’) 

a ce, 
where js(X) is the Bayes rule given in (2.25) and c? is also given in (2.25). 
Since this result is true for any 0§ > 0 and c? > o?/n as of — 00, 


o2 


sup Rr(0) > — = sup Rx(6), 
OER n OER 
where the equality holds because the risk of X under the squared error loss 
is, by (2.20), o?/n and independent of 6 = yw. Thus, X is minimax. 
A minimax rule in a general case may be difficult to obtain. It can be 
seen that if both and o? are unknown in the previous discussion, then 


sup Rx($)=~, (2.26) 
OER (0,00) 


where @ = (1,07). Hence X cannot be minimax unless (2.26) holds with 
X replaced by any decision rule T, in which case minimaxity becomes 
meaningless. 


122 2. Fundamentals of Statistics 


2.4 Statistical Inference 


The loss function plays a crucial role in statistical decision theory. Loss 
functions can be obtained from a utility analysis (Berger, 1985), but in 
many problems they have to be determined subjectively. In statistical in- 
ference, we make an inference about the unknown population based on 
the sample X and inference procedures without using any loss function, al- 
though any inference procedure can be cast in decision-theoretic terms as 
a decision rule. 


There are three main types of inference procedures: point estimators, 
hypothesis tests, and confidence sets. 


2.4.1 Point estimators 


The problem of estimating an unknown parameter related to the unknown 
population is introduced in Example 2.19 and the discussion after Example 
2.19 as a special statistical decision problem. In statistical inference, how- 
ever, estimators of parameters are derived based on some principle (such as 
the unbiasedness, invariance, sufficiency, substitution principle, likelihood 
principle, Bayesian principle, etc.), not based on a loss or risk function. 
Since confidence sets are sometimes also called interval estimators or set 
estimators, estimators of parameters are called point estimators. 


In Chapters 3 through 5, we consider how to derive a “good” point esti- 
mator based on some principle. Here we focus on how to assess performance 
of point estimators. 


Let 0) € OCR be a parameter to be estimated, which is a function of 
the unknown population P or @ if P is in a parametric family. An estimator 
is a statistic with range 0. First, one has to realize that any estimator T(X) 
of J is subject to an estimation error T(2) — 0 when we observe X = «. 
This is not just because T(X) is random. In some problems T(x) never 
equals J. A trivial example is when T(X) has a continuous c.d.f. so that 
P(T(X) = Vv) =0. As a nontrivial example, let X1,..., Xp be ii.d. binary 
random variables (also called Bernoulli variables) with P(X; = 1) = p and 
P(X; = 0) =1-—p. The sample mean X is shown to be a good estimator 
of 0 = p in later chapters, but Z never equals Vv if ¥ is not one of j/n, 
j = 0,1,...,n. Thus, we cannot assess the performance of T(X) by the 
values of T(x) with particular «’s and it is also not worthwhile to do so. 


The bias br(P) and unbiasedness of a point estimator T'(X) is defined 
in Definition 2.8. Unbiasedness of T(X) means that the mean of T(X) is 
equal to ¥. An unbiased estimator T(X) can be viewed as an estimator 
without “systematic” error, since, on the average, it does not overestimate 
(i.e., br(P) > 0) or underestimate (i.e., br(P) < 0). However, an unbiased 


2.4. Statistical Inference 123 


estimator T'(X) may have large positive and negative errors T(x)—0, x € X, 
although these errors cancel each other in the calculation of the bias, which 
is the average [ (T(x) — J]dPx (2). 

Hence, for an unbiased estimator T(X), it is desired that the values of 
T(x) be highly concentrated around J. The variance of T(X) is commonly 
used as a measure of the dispersion of T(X). The mean squared error (mse) 
of T(X) as an estimator of is defined to be 


mser(P) = E[T(X) — 0]? = [br(P)]? + Var(T(X)), (2.27) 


which is denoted by mser(6) if P is in a parametric family. mser(P) is 
equal to the variance Var(T'(X)) if and only if T(X) is unbiased. Note 
that the mse is simply the risk of T in statistical decision theory under the 
squared error loss. 


In addition to the variance and the mse, the following are other measures 
of dispersion that are often used in point estimation problems. The first one 
is the mean absolute error of an estimator T(X) defined to be E|T(X) — VI. 
The second one is the probability of falling outside a stated distance of ¥, 
ie., P(|T(X) — ¥| > ©) with a fixed € > 0. Again, these two measures of 
dispersion are risk functions in statistical decision theory with loss functions 
|? — al and I(c,.0)({¥ — al), respectively. 

For the bias, variance, mse, and mean absolute error, we have implicitly 
assumed that certain moments of T(X) exist. On the other hand, the dis- 
persion measure P(|T'(X)—v| > €) depends on the choice of €. It is possible 
that some estimators are good in terms of one measure of dispersion, but 
not in terms of other measures of dispersion. The mse, which is a function 
of bias and variance according to (2.27), is mathematically easy to handle 
and, hence, is used the most often in the literature. In this book, we use 
the mse to assess and compare point estimators unless otherwise stated. 


Examples 2.19 and 2.22 provide some examples of estimators and their 
biases, variances, and mse’s. The following are two more examples. 


Example 2.26. Consider the life-time testing problem in Example 2.2. Let 
Xy,...,Xy be i.i.d. from an unknown c.d.f. F. Suppose that the parameter 
of interest is ) = 1 — F(t) for a fixed t > 0. If F' is not in a parametric 
family, then a nonparametric estimator of F(t) is the empirical c.d.f. 


1 n 
= 3 I-c0,t(Xi), - tER. (2.28) 


Since I(_o6,4)(X1), +s [(—00,4](Xn) are iid. binary random variables with 
P(U(~co 4](Xi) = 1) = F(t), the random variable nF, (t) has the binomial 
distribution Bi(F(t),n). Consequently, F(t) is an unbiased estimator of 


124 2. Fundamentals of Statistics 


F(t) and Var(F;,(t)) = msep,4)(P) = F(t)[1 — F(t)]/n. Since any linear 
combination of unbiased estimators is unbiased for the same linear com- 
bination of the parameters (by the linearity of expectations), an unbiased 
estimator of 3? is U(X) = 1— F,,(t), which has the same variance and mse 
as Fi, (t). 

The estimator U(X) = 1 — F,(t) can be improved in terms of the 
mse if there is further information about F’. Suppose that F is the c.d.f. 
of the exponential distribution E£(0,6) with an unknown 6 > 0. Then 
0 =e7*/®, From §2.2.2, the sample mean X is sufficient for 9 > 0. Since 
the squared error loss is strictly convex, an application of Theorem 2.5(ii) 
(Rao-Blackwell theorem) shows that the estimator T(X) = E[1—F,(t)|X], 
which is also unbiased, is better than U(X) in terms of the mse. Figure 
2.1 shows graphs of the mse’s of U(X) and T(X), as functions of 0, in the 
special case of n = 10, t = 2, and F(x) = (1—e7*/®)Io,0)(x). 


Example 2.27. Consider the sample survey problem in Example 2.3 with a 
constant selection probability p(s) and univariate y;. Let 0) = Y = See Vis 
the population total. We now show that the estimator Y = % Fics or is 


nm 
an unbiased estimator of Y. Let a; = 1 if7 € s and a; = 0 otherwise. Thus, 


y=* Se a:yi. Since p(s) is constant, E(a;) = P(a; = 1) = n/N and 


: N N N N N 
EY)=E (2 Yau) ae Sow E(ai) = doy =. 
w=1 w=1 t=1 


Note that is ia 
Var(ai) = E(ai) — [E(a)? = = (1- =) 
and for i 4 j, 
n(n-1) nr? 


Cov(aj, a;) P(a; 1a; 1) a E(a;)E(a;) Sr ee 


. N?2 N 
Var(Y) = a vat Se aiyi 
i=1 


Nn? | * 
=z S- y? Var(a;) + 2 a Yyiyj Cov(az, a;) 
ia (re 1<i<j<N 


2.4. Statistical Inference 125 


mse 
0.02 
i 


0 


Figure 2.1: mse’s of U(X) and T(X) in Example 2.26 


2.4.2 Hypothesis tests 


The basic elements of a hypothesis testing problem are described in Exam- 
ple 2.20. In statistical inference, tests for a hypothesis are derived based on 
some principles similar to those given in an estimation problem. Chapter 
6 is devoted to deriving tests for various types of hypotheses. Several key 
ideas are discussed here. 


To test the hypotheses Hp versus H; given in (2.21), there are only two 
types of statistical errors we may commit: rejecting Hp when Ho is true 
(called the type I error) and accepting Hp when Hp is wrong (called the 
type IT error). In statistical inference, a test T, which is a statistic from X 
to {0,1}, is assessed by the probabilities of making two types of errors: 


ar(P) = P(T(X) = 1) PEPo (2.29) 
and 
1—ar(P) = P(T(X) = 0) PeEeP,, (2.30) 


which are denoted by a7(@) and 1 — ar(6) if P is in a parametric family 
indexed by @. Note that these are risks of T under the 0-1 loss in statistical 
decision theory. However, an optimal decision rule (test) does not exist even 
for a very simple problem with a very simple class of tests (Example 2.23). 


126 2. Fundamentals of Statistics 


That is, error probabilities in (2.29) and (2.30) cannot be minimized simul- 
taneously. Furthermore, these two error probabilities cannot be bounded 
simultaneously by a fixed a € (0,1) when we have a sample of a fixed size. 


Therefore, a common approach to finding an “optimal” test is to assign 
a small bound a to one of the error probabilities, say ar(P), P € Po, and 
then to attempt to minimize the other error probability 1—a7(P), P € P,, 
subject to 
sup a7(P) <a. (2.31) 
PEPs 
The bound a is called the level of significance. The left-hand side of (2.31) 
is called the size of the test T. Note that the level of significance should 
be positive, otherwise no test satisfies (2.31) except the silly test T(X) = 0 
as. P. 


Example 2.28. Let X1,..., Xn be iid. from the N(, 07) distribution with 
an unknown p € R and a known o?. Consider the hypotheses 


Ho: < po versus Ay: > bo, 


where jig is a fixed constant. Since the sample mean X is sufficient for 
11 € R, it is reasonable to consider the following class of tests: T.(X) = 
I(c,co)(X), ie., Ho is rejected (accepted) if X > c (X <c), where cE R is 
a fixed constant. Let ® be the c.d.f. of N(0,1). Then, by the property of 
the normal distributions, 


ar,(u) = P(T.(X) =1)=1-6 pao) (2.32) 


Figure 2.2 provides an example of a graph of two types of error probabilities, 
with jo = 0. Since ®(t) is an increasing function of f, 


sup ar, (4) =—-1-© (eo) ; 
PEPo o 
In fact, it is also true that 
nlc — 
sup (1 —ar,(49) = @ (A) 
PEP o 


If we would like to use an a as the level of significance, then the most 
effective way is to choose a Cy (a test T,, (X)) such that 


a= sup ar,, (y), 
PEP» 


in which case ca must satisfy 


1-9 (Lanta) 


2.4. Statistical Inference 127 


oe 
sas 
a 4 
‘o) 

> 

A= 

‘a Oo 4] 

@ ‘s) 

a 

Q 

a 

5 t 

© o | 

= 

o 
N 4 
ro) 
oO 
rs) 


-2 -1 0 1 2 


m 


Figure 2.2: Error probabilities in Example 2.28 


ie., Cy = O2Z1-a/V/n + fo, where z = ®~1(a). In Chapter 6, it is shown 
that for any test T(X) satisfying (2.31), 


l—ar(u) 21-ar,(u), w>po. I 


The choice of a level of significance a is usually somewhat subjective. 
In most applications there is no precise limit to the size of T that can be 
tolerated. Standard values, such as 0.10, 0.05, or 0.01, are often used for 
convenience. 

For most tests satisfying (2.31), a small @ leads to a “small” rejection 
region. It is good practice to determine not only whether Ho is rejected or 
accepted for a given a and a chosen test T,, but also the smallest possible 
level of significance at which Ho would be rejected for the computed Ty (x), 
ie., @ = inf{a € (0,1): Ta (a) = 1}. Such an 4, which depends on x and 
the chosen test and is a statistic, is called the p-value for the test Ty. 


Example 2.29. Consider the problem in Example 2.28. Let us calculate 
the p-value for T,,,. Note that 


a=1-9 (eh) > 1-0 (CW) 


on 


128 2. Fundamentals of Statistics 


if and only if > cq (or T.,,(x) = 1). Hence 
1-0 os = inf{a € (0,1): Te, («) = 1} = A(z) 
is the p-value for T,. It turns out that T..(v) = I(o,a)(A(a)). 


With the additional information provided by p-values, using p-values is 
typically more appropriate than using fixed-level tests in a scientific prob- 
lem. However, a fixed level of significance is unavoidable when acceptance 
or rejection of Hp implies an imminent concrete decision. For more discus- 
sions about p-values, see Lehmann (1986) and Weerahandi (1995). 


In Example 2.28, the equality in (2.31) can always be achieved by a 
suitable choice of c. This is, however, not true in general. In Example 2.23, 
for instance, it is possible to find an a such that 


sup P(T;(X)=1)4a 
0<0<0o 


for all 7;’s. In such cases, we may consider randomized tests, which are 
introduced next. 


Recall that a randomized decision rule is a probability measure 6(z, -) 
on the action space for any fixed x. Since the action space contains only 
two points, 0 and 1, for a hypothesis testing problem, any randomized test 
6(X, A) is equivalent to a statistic T(X) € [0,1] with T(x) = d(x, {1}) and 
1— T(x) = 6(z,{0}). A nonrandomized test is obviously a special case 
where T(x) does not take any value in (0,1). 

For any randomized test T(X), we define the type I error probability 
to be ar(P) = E[T(X)], P © Po, and the type II error probability to be 
l—ar(P) = E[1—T(X)], P € Pi. For a class of randomized tests, we 
would like to minimize 1 — a7(P) subject to (2.31). 


Example 2.30. Consider Example 2.23 and the following class of random- 
ized tests: 


1 X>j 
Tj,q(X) = q X=j 
0 X <j, 


where j = 0,1,...,..—1 and q€ [0,1]. Then 
ar, (0) = P(X >j)+qP(X=j) 0<0<% 
and 
1—ar,,(0)=P(X <f)+(Ul-@gPX=jf) 6<@<1. 


It can be shown that for any a € (0, 1), there exist an integer j and q € (0, 1) 
such that the size of T;,, is a (exercise). I 


2.4. Statistical Inference 129 


2.4.3 Confidence sets 


Let J be a k-vector of unknown parameters related to the unknown pop- 
ulation P € P and C(X) € BE depending only on the sample X, where 


0 € B* is the range of ¥. If 


Sane 
inf P(E C(X)) >1-a, (2.33) 


where a is a fixed constant in (0,1), then C(X) is called a confidence set 
for 0 with level of significance 1— a. The left-hand side of (2.33) is called 
the confidence coefficient of C(X), which is the highest possible level of 
significance for C(X). A confidence set is a random element that covers 
the unknown ¥ with certain probability. If (2.33) holds, then the coverage 
probability of C(X) is at least 1—a, although C(x) either covers or does not 
cover 0 whence we observe X = x. The concepts of level of significance and 
confidence coefficient are very similar to the level of significance and size in 
hypothesis testing. In fact, it is shown in Chapter 7 that some confidence 
sets are closely related to hypothesis tests. 


Consider a real-valued ¥. If C(X) = [3(X), 0(X)] for a pair of real- 
valued statistics ¥ and V, then C(X) is called a confidence interval for J. 
If C(X) = (—co, V(X)] (or [¥(X), 00)), then V (or ¥) is called an upper (or 
a lower) confidence bound for V. 

A confidence set (or interval) is also called a set (or an interval) estimator 


of J, although it is very different from a point estimator (discussed in 
§2.4.1). 


Example 2.31. Consider Example 2.28. Suppose that a confidence inter- 
val for 0 = p is needed. Again, we only need to consider 0(X) and J(X), 
since the sample mean X is sufficient. Consider confidence intervals of the 
form [X — c, X +c], where c € (0,00) is fixed. Note that 


P(we [X —¢,X +c) =P (|X —p| <c) = 1-28 (-Vne/o), 


which is independent of jz. Hence, the confidence coefficient of [X —c, X +c] 
is 1— 26 (—,/nc/c), which is an increasing function of c and converges to 1 
as c—> co or 0 asc — 0. Thus, confidence coefficients are positive but less 
than 1 except for silly confidence intervals [X,X] and (—o0,00). We can 
choose a confidence interval with an arbitrarily large confidence coefficient, 
but the chosen confidence interval may be so wide that it is practically 
useless. 


If o? is also unknown, then [X — c, X +c] has confidence coefficient 0 
and, therefore, is not a good inference procedure. In such a case a different 
confidence interval for 44 with positive confidence coefficient can be derived 
(Exercise 97 in §2.6). 


130 2. Fundamentals of Statistics 


This example tells us that a reasonable approach is to choose a level of 
significance 1 — a € (0,1) (just like the level of significance in hypothesis 
testing) and a confidence interval or set satisfying (2.33). In Example 2.31, 
when o? is known and c is chosen to be 02~9/2//n, where zq = ®-'(a), 
the confidence coefficient of the confidence interval [X —c, X +c] is exactly 
1 — av for any fixed a € (0,1). This is desirable since, for all confidence 
intervals satisfying (2.33), the one with the shortest interval length is pre- 
ferred. 


For a general confidence interval [3(X), 0(X)], its length is 0(X)—¥(X), 
which may be random. We may consider the expected (or average) length 
E[d(X)—0(X)]. The confidence coefficient and expected length are a pair of 
good measures of performance of confidence intervals. Like the two types 
of error probabilities of a test in hypothesis testing, however, we cannot 
maximize the confidence coefficient and minimize the length (or expected 
length) simultaneously. A common approach is to minimize the length (or 
expected length) subject to (2.33). 


For an unbounded confidence interval, its length is oo. Hence we have 
to define some other measures of performance. For an upper (or a lower) 
confidence bound, we may consider the distance 0(X) — 0 (or 0 — 9(X)) or 
its expectation. 


To conclude this section, we discuss an example of a confidence set for 
a two-dimensional parameter. General discussions about how to construct 
and assess confidence sets are given in Chapter 7. 


Example 2.32. Let Xj,...,X» be iid. from the N(y,o7) distribution 
with both » € R and o? > 0 unknown. Let @ = (,07) and a € (0,1) be 
given. Let X be the sample mean and S$? be the sample variance. Since 
(X, S?) is sufficient (Example 2.15), we focus on C(X) that is a function of 
(X, $7). From Example 2.18, X and S$? are independent and (n — 1),$?/o? 
has the chi-square distribution .2_,. Since \/n(X — p)/o has the N(0, 1) 
distribution (Exercise 43 in §1.6), 


where Gy = ®-! (+4¥* ) (verify). Since the chi-square distribution x? _ 
Dp} y n-1 


is a known distribution, we can always find two constants C14, and cz. such 


that 92 
—1 
P (ens BP <ea0) = 1l—-a. 
oO 
Then 
X= —1)S? 
Be ee is, 
a//n o? 


2.5. Asymptotic Criteria and Inference 131 


o - 
oO a 
o FE 
{S) 
Cc 
& 
S v4 
> ¥ 
.: 
q 
N 4 ¥ ] 
y 
o iy 
T T T T 
4 2 0 2 4 
mean 


Figure 2.3: A confidence set for 0 in Example 2.32 


or 


a 2 
<o CEE cots CANE) L1H, (2.34) 


on C2a Cla 
The left-hand side of (2.34) defines a set in the range of 0 = (1,07) bounded 
by two straight lines, 0? = (n — 1)S?/cia, i = 1,2, and a curve o? 
n(X —p)?/é (see the shadowed part of Figure 2.3). This set is a confidence 
set for 6 with confidence coefficient 1— a, since (2.34) holds for any 6. I 


2.5 Asymptotic Criteria and Inference 


We have seen that in statistical decision theory and inference, a key to 
the success of finding a good decision rule or inference procedure is being 
able to find some moments and/or distributions of various statistics. Al 
though many examples are presented (including those in the exercises in 
§2.6), there are more cases in which we are not able to find exactly the 
moments or distributions of given statistics, especially when the problem 
is not parametric (see, e.g., the discussions in Example 2.8). 

In practice, the sample size n is often large, which allows us to ap- 
proximate the moments and distributions of statistics that are impossible 


132 2. Fundamentals of Statistics 


to derive, using the asymptotic tools discussed in §1.5. In an asymptotic 
analysis, we consider a sample X = (Xj,...,X,,) not for fixed n, but as a 
member of a sequence corresponding to n = no, no + 1,..., and obtain the 
limit of the distribution of an appropriately normalized statistic or variable 
Tn(X) as n > oo. The limiting distribution and its moments are used as 
approximations to the distribution and moments of T,,(X) in the situation 
with a large but actually finite n. This leads to some asymptotic statistical 
procedures and asymptotic criteria for assessing their performances, which 
are introduced in this section. 


The asymptotic approach is not only applied to the situation where no 
exact method is available, but also used to provide an inference procedure 
simpler (e.g., in terms of computation) than that produced by the exact 
approach (the approach considering a fixed n). Some examples are given 
in later chapters. 


In addition to providing more theoretical results and/or simpler infer- 
ence procedures, the asymptotic approach requires less stringent mathemat- 
ical assumptions than does the exact approach. The mathematical precision 
of the optimality results obtained in statistical decision theory, for example, 
tends to obscure the fact that these results are approximations in view of 
the approximate nature of the assumed models and loss functions. As the 
sample size increases, the statistical properties become less dependent on 
the loss functions and models. However, a major weakness of the asymp- 
totic approach is that typically no good estimates for the precision of the 
approximations are available and, therefore, we cannot determine whether 
a particular n in a problem is large enough to safely apply the asymptotic 
results. To overcome this difficulty, asymptotic results are frequently used 
in combination with some numerical/empirical studies for selected values 
of n to examine the finite sample performance of asymptotic procedures. 


2.5.1 Consistency 


A reasonable point estimator is expected to perform better, at least on 
the average, if more information about the unknown population is avail- 
able. With a fixed model assumption and sampling plan, more data (larger 
sample size n) provide more information about the unknown population. 
Thus, it is distasteful to use a point estimator T,, which, if sampling were 
to continue indefinitely, could possibly have a nonzero estimation error, al- 
though the estimation error of T,, for a fixed n may never equal 0 (see the 
discussion in §2.4.1). 


Definition 2.10 (Consistency of point estimators). Let X = (X1,..., Xn) 
be a sample from P € P and T;,(X) be a point estimator of 0 for every n. 
(i) Tn (X) is called consistent for 0 if and only if T,(X) —, J w.r.t. any 


2.5. Asymptotic Criteria and Inference 133 


PEP. 

(ii) Let {an} be a sequence of positive constants diverging to oo. T;,(X) is 
called a,-consistent for V if and only if an[T,(X) — 0] = O,(1) w.r.t. any 
PEP. 

(iii) T,(X) is called strongly consistent for 0 if and only if T,(X) as. 0 
w.r.t. any Pe P. 

(iv) T,(X) is called L,-consistent for Y if and only if T,(X) -1, V w.r.t. 
any P © P for some fixedr>0. I 


Consistency is actually a concept relating to a sequence of estimators, 
{T,,n = no,no +1,...$, but we usually just say “consistency of T,,” for 
simplicity. Each of the four types of consistency in Definition 2.10 describes 
the convergence of T,,(X) to J in some sense, as n — oo. In statistics, 
consistency according to Definition 2.10(i), which is sometimes called weak 
consistency since it is implied by any of the other three types of consistency, 
is the most useful concept of convergence of T;,, to J. L2-consistency is also 
called consistency in mse, which is the most useful type of L,-consistency. 


Example 2.33. Let X1,..., Xp be i.i.d. from P € P. If 0 = yw, which is 
the mean of P and is assumed to be finite, then by the SLLN (Theorem 
1.13), the sample mean X is strongly consistent for j: and, therefore, is 
also consistent for yu. If we further assume that the variance of P is finite, 
then by (2.20), X is consistent in mse and is \/n-consistent. With the finite 
variance assumption, the sample variance S$? is strongly consistent for the 
variance of P, according to the SLLN. 


Consider estimators of the form T, = 7}, CniXi, where {cni} is a 
double array of constants. If P has a finite variance, then by (2.24), T, 
is consistent in mse if and only if >), cn; > 1 and $7", 2, > 0. If we 
only assume the existence of the mean of P, then T,, with cn; = c/n sat- 
isfying n~' 37, G — 1 and sup, |c;| < oo is strongly consistent (Theorem 
1.13(ii)). 


One or a combination of the law of large numbers, the CLT, Slutsky’s 
theorem (Theorem 1.11), and the continuous mapping theorem (Theorems 
1.10 and 1.12) are typically applied to establish consistency of point estima- 
tors. In particular, Theorem 1.10 implies that if T,, is (strongly) consistent 
for V and g is a continuous function of V, then g(T;,) is (strongly) consistent 
for g(9). For example, in Example 2.33 the point estimator X? is strongly 
consistent for u?. To show that X? is ,/n-consistent under the assumption 
that P has a finite variance o?, we can use the identity 


V(X? — py?) = Vn(X — p)(X + p) 


and the fact that X is /n-consistent for and X + « =O,(1). (Note that 


134 2. Fundamentals of Statistics 


X? may not be consistent in mse since we do not assume that P has a finite 
fourth moment.) Alternatively, we can use the fact that /n(X? — p?) 4 
N(0, 4207) (by the CLT and Theorem 1.12) to show the \/n-consistency 
of X?. 

The following example shows another way to establish consistency of 
some point estimators. 


Example 2.34. Let Xj,...,X, be iid. from an unknown P with a con- 
tinuous c.d.f. F satisfying F(@) = 1 for some 6 € R and F(x) < 1 for any 
x < @. Consider the largest order statistic X(,). For any « > 0, F(@—€) <1 
and 
P(X(m) — 91 > 6) = P(X $ 8-6) = [FO—Oll", 

which imply (according to Theorem 1.8(v)) Xin) —a.s. 9, ie, X(ny is 
strongly consistent for 6. If we assume that F“(@—), the ith-order left- 
hand derivative of F’ at 6, exists and vanishes for any ¢ < m and that 
F(™+1)(9—) exists and is nonzero, where m is a nonnegative integer, then 


(1m Fe) (6-) 


LCS =a aa 


(0—X(n))™** +0(\O- Xm |"F") as. 
This result and the fact that P (n[{1 — F(X(n))] 2 5) = (1 — s/n)” imply 
that (0 — X(n))™*? = Op(n-?), ie., X(n) is n°™+)  -consistent. If m = 0, 
then X(,) is n-consistent, which is the most common situation. If m = 1, 
then Xn) is \/n-consistent. The limiting distribution of nim+1)™ (X(n) — 9) 
can be derived as follows. Let 
na(0) = | <Dm + 0) ey 

For t < 0, by Slutsky’s theorem, 


Xiay -0 C= es 
: (n) ‘ (n) m+1 
i P{ ——— <t)= 1 P —— > (-t 
n—00 ( hn(O)  ~ ) nase (| hn(@) | a8 


It can be seen from the previous examples that there are many consistent 
estimators. Like the admissibility in statistical decision theory, consistency 
is a very essential requirement in the sense that any inconsistent estimators 
should not be used, but a consistent estimator is not necessarily good. 
Thus, consistency should be used together with one or a few more criteria. 


2.5. Asymptotic Criteria and Inference 135 


We now discuss a situation in which finding a consistent estimator is 
crucial. Suppose that an estimator 7), of J satisfies 


Cn|Tn(X) — 8] a oY, (2.35) 


where Y is a random variable with a known distribution, ¢ > 0 is an 
unknown parameter, and {c,,} is a sequence of constants; for example, in 
Example 2.33, /n(X — 1) >a N(0,o7); in Example 2.34, (2.35) holds 
with c, = nO"t)™ and o = [(-1)™(m + I/F (9-)]JO™D™ Tk a 
consistent estimator 6, of g can be found, then, by Slutsky’s theorem, 


Cn[Tn(X) — B]/Gn a Y 


and, thus, we may approximate the distribution of cp[Tn(X) — V]/G, by 
the known distribution of Y. 


2.5.2 Asymptotic bias, variance, and mse 


Unbiasedness as a criterion for point estimators is discussed in §2.3.2 and 
§2.4.1. In some cases, however, there is no unbiased estimator (Exercise 84 
in §2.6). Furthermore, having a “slight” bias in some cases may not be a 
bad idea (see Exercise 63 in §2.6). Let T,,(X) be a point estimator of 0 
for every n. If ET, exists for every n and limp. F(T, — ¥) = 0 for any 
PeéP, then T,, is said to be approximately unbiased. 


There are many reasonable point estimators whose expectations are 
not well defined. For example, consider i.i.d. (X1,Y1),...,(Xn, Yn) from a 
bivariate normal distribution with uw, = EX, and py = EY, # 0. Let 
0 = [x/fy and T;,, = X/Y, the ratio of two sample means. Then ET), is 
not defined for any n. It is then desirable to define a concept of asymptotic 
bias for point estimators whose expectations are not well defined. 


Definition 2.11. (i) Let €,&,&,... be random variables and {a,} be 
a sequence of positive numbers satisfying a, — co or d, — a > 0. If 
anfn a € and E|€| < oo, then EE/a, is called an asymptotic expectation 
of En. 

(ii) Let T;, be a point estimator of V for every n. An asymptotic expectation 
of T,, — Vv, if it exists, is called an asymptotic bias of JT, and denoted by 
br, (P) (or br, (0) if P is in a parametric family). If limy_.. br, (P) = 0 for 
any P €P, then T;, is said to be asymptotically unbiased. 


Like the consistency, the asymptotic expectation (or bias) is a concept 
relating to sequences {€,} and {E£/a,} (or {T,} and {br,(P)}). Note 
that the exact bias br, (P) is not necessarily the same as br, (P) when both 
of them exist (Exercise 115 in $2.6). The following result shows that the 
asymptotic expectation defined in Definition 2.11 is essentially unique. 


136 2. Fundamentals of Statistics 


Proposition 2.3. Let {€,} be a sequence of random variables. Suppose 
that both E€/a, and E7n/b, are asymptotic expectations of €, defined 
according to Definition 2.11(i). Then, one of the following three must hold: 
(a) BE = En = 0; (b) EE £0, En = 0, and b,/an — 0; or EE = 0, En £0, 
and an /bn > 0; (c) EE £0, En £0, and (EE/an)/(En/bn) — 1. 

Proof. According to Definition 2.11(i), an€, a € and bn€, a 1. 

(i) If both € and 7 have nondegenerate c.d.f.’s, then the result follows from 
Exercise 129 of §1.6. 

(ii) Suppose that € has a nondegenerate c.d.f. but 7 is a constant. If 7 4 0, 
then by Theorem 1.11 (iii), an/bp — €/n, which is impossible since € has a 
nondegenerate c.d.f. If 7 = 0, then by Theorem 1.11(ii), by /a,— 0. 

(iii) Suppose that both € and 7 are constants. If € = 7 = 0, the result 
follows. If € £4 0 and 7 = 0, then b,,/a, — 0. If € 4 0 and 7 ¥ 0, then 


bn/an—2>n/é. Wl 


If T, is a consistent estimator of J, then T,, = 0 + 0,(1) and, by Defi- 
nition 2.11(ii), T,, is asymptotically unbiased, although T,, may not be ap- 
proximately unbiased; in fact, g(T,,) is asymptotically unbiased for g(v) for 
any continuous function g. For the example of T, = X/Y, Tn —a.s. La / by 
by the SLLN and Theorem 1.10. Hence T;, is asymptotically unbiased, al- 
though ET, may not be defined. In Example 2.34, X(,) has the asymptotic 


bias bx, (P) = hn(0)EY, which is of order n~("™+)", 


When a@n(Tn — 8) >a Y with EY = 0 (eg., T, = X? and 0 = p? in 
Example 2.33), a more precise order of the asymptotic bias of T,, may be 
obtained (for comparing different estimators in terms of their asymptotic 
biases). Suppose that there is a sequence of random variables {7,,} such 
that 


Qntn >a Y and ah (Tn -—v- Mn) —a W, (2.36) 


where Y and W are random variables with finite means, EY = 0 and 
EW #0. Then we may define a? to be the order of br,(P) or define 
EW /a2 to be the a;,? order asymptotic bias of T;,. However, np, in (2.36) 
may not be unique. Some regularity conditions have to be imposed so that 
the order of asymptotic bias of T,, can be uniquely defined. In the following 
we focus on the case where Xj,..., Xp are i.i.d. random k-vectors. Suppose 
that T;, has the following expansion: 


a PX) EGY VHX) + op (=). (2.37) 


i=1 j=1 


where ¢ and 7 are functions that may depend on P, E¢(X1) = 0, E[¢(X1)]? 
< 00, W(e,y) = Vly, 2), BY(x, X1) = 0 for all x, Ely(Xs, Xj)? < 00, # <j, 
and Ew(X1,X1) 4 0. From the result for V-statistics in §3.5.3 (Theorem 


2.5. Asymptotic Criteria and Inference 137 


3.16 and Exercise 113 in §3.6), 


= HX,X3) a W, 
i=1 j=l 

where W is a random variable with EW = Ey(X,,X1). Hence (2.36) 
holds with a, = \/n and m, = n~'S~_, ¢(X;). Consequently, we can 
define E4)(X1, X1)/n to be the n~! order asymptotic bias of T,,. Examples 
of estimators that have expansion (2.37) are provided in §3.5.3 and §5.2.1. 
In the following we consider the special case of functions of sample means. 

Let X1,..., Xp be iid. random k-vectors with finite © = Var(X,), X = 
n-1 yon, Xi, and T,;, = g(X), where g is a function on R* that is second- 
order differentiable at uw = EX, © R*. Consider T,, as an estimator of 0 = 
g(4s). Using Taylor’s expansion, we obtain expansion (2.37) with $(x) = 
[Vg(u)]7 (@— p) and (2, y) = (a— 2)" V?9(u)(y — 2)/2, where Vg is the k- 
vector of partial derivatives of g and Vg is the k x k matrix of second-order 
partial derivatives of g. By the CLT and Theorem 1.10(iii), 


1 non n= x = ZEV7q m Z 
= W(X, Xj) = g(X — w)V?g(u)(X — w) >a — 
i=1 j=l 
where Zy = N;(0,5). Thus, 
ELZSV"9(u)Z5) _ tt (V*9(u)®) (2.38) 


2n 2n 
is the n—! order asymptotic bias of T, = g(X), where tr(A) denotes the 
trace of the matrix A. Note that the quantity in (2.38) is the same as the 


leading term in the exact bias of T,, = g(X) obtained under a much more 
stringent condition on the derivatives of g (Lehmann, 1983, Theorem 2.5.1). 


Example 2.35. Let Xj,...,X, be iid. binary random variables with 
P(X; = 1) = p, where p € (0,1) is unknown. Consider first the estimation 
of 0 = p(1—p). Since Var(X) = p(1—p)/n, the n—! order asymptotic bias of 
Tn = X(1—X) according to (2.38) with g(x) = x(1—2) is —p(1—p)/n. On 
the other hand, a direct computation shows E[X(1— X)] = EX — EX? = 
p— (EX)? — Var(X) = p(1— p) — p(1 — p)/n. Hence, the exact bias of T, 
is the same as the n~! order asymptotic bias. 


Consider next the estimation of 9 = p—'. In this case, there is no 


unbiased estimator of p~! (Exercise 84 in §2.6). Let T,, = X~+. Then, an 
n—' order asymptotic bias of T;, according to (2.38) with g(x) = 27! is 


(1 — p)/(p?n). On the other hand, ET, = 00 for every n. 


Like the bias, the mse of an estimator T;, of J, mser, (P) = E(Tn —¥)?, 
is not well defined if the second moment of T;,, does not exist. We now 


138 2. Fundamentals of Statistics 


define a version of asymptotic mean squared error (amse) and a measure of 
assessing different point estimators of a common parameter. 


Definition 2.12. Let T,, be an estimator of J for every n and {a,} be a 
sequence of positive numbers satisfying a, — oo or dn — a > 0. Assume 
that an(Tn — 9) 4 Y with 0 < EY? < o. 

(i) The asymptotic mean squared error of T,,, denoted by amser, (P) or 
amser, (9) if P is in a parametric family indexed by 0, is defined to be 
the asymptotic expectation of (T;, — V)?, ie., amser, (P) = EY?/a?. The 
asymptotic variance of T,, is defined to be 07, (P) = Var(Y)/a‘,. 

(ii) Let T’ be another estimator of 0. The asymptotic relative efficiency of 
Ty, w.t.r. T,, is defined to be er: 7, (P) = amser, (P)/amser: (P). 

(iii) T, is said to be asymptotically more efficient than T’, if and only if 
lim sup,, err, (P) <1 for any P and <1forsome P. I 


The amse and asymptotic variance are the same if and only if EY = 0. 
By Proposition 2.3, the amse or the asymptotic variance of T;, is essen- 
tially unique and, therefore, the concept of asymptotic relative efficiency in 
Definition 2.12(ii)-(iii) is well defined. 

In Example 2.33, amsex2(P) = o%2(P) = 4y?0?/n. In Example 2.34, 
OFX en) (P) = [hn(0)|?Var(Y) and amsex,,. (P) = [An(0)|?EY?. 


When both msez, (P) and msey, (P) exist, one may compare T;,, and 
Ty, by evaluating the relative efficiency mser, (P)/msez, (P). However, this 
comparison may be different from the one using the asymptotic relative 
efficiency in Definition 2.12(ii), since the mse and amse of an estimator 
may be different (Exercise 115 in §2.6). The following result shows that 
when the exact mse of T;, exists, it is no smaller than the amse of T,,. It 
also provides a condition under which the exact mse and the amse are the 
same. 


Proposition 2.4. Let T,, be an estimator of J for every n and {a,} bea 
sequence of positive numbers satisfying a, — oo or dn — a > 0. Suppose 
that an(Tn — 9) a Y with 0 < EY? < oo. Then 

(i) EY? < liminf, E[a2(T,, — 9)?] and 

(ii) EY? = limp. Ela? (Tn —¥)?] if and only if {a2 (T, — ¥)?} is uniformly 
integrable. 

Proof. (i) By Theorem 1.10(iii), 


min{a?(T, — 3)?,t} >¢ min{Y?, t} 
for any t > 0. Since min{a?(T,, — 7)?,t} is bounded by t, 


lim E(min{a?(T;, — 0)?,t}) = E(min{Y°, t}) 


nN— Oo 


2.5. Asymptotic Criteria and Inference 139 


(Theorem 1.8(viii)). Then 
Ey? = Jim E(min{Y?, t}) 
lim lim E(min{a?(T, — 0), t}) 


too n—-0co 


lim inf E(min{a2(T, — 9), t}) 


l| 


I 


< liminf Ela? (T, — ¥)"], 


where the third equality follows from the fact that E(min{a?(T,, — 0)?, t}) 
is nondecreasing in t for any fixed n. 
(ii) The result follows from Theorem 1.8(viii). I 


Example 2.36. Let X1,...,X, be i.i.d. from the Poisson distribution P(@) 
with an unknown 6 > 0. Consider the estimation of 9 = P(X; = 0) = e7°. 
Let Ti, = F,(0), where F,, is the empirical c.d.f. defined in (2.28). Then 
Tin is unbiased and has mser,,, (0) = e~°(1—e7°)/n. Also, V/n(Tin-V) a 
N(0,e~°(1—e~°)) by the CLT. Thus, in this case amsez,, (0) = mser,, (0). 


Next, consider To, = e-*. Note that ETo, = er(e/"-1)_ Hence 
nbr,,, (0) + 0e-°/2. Using Theorem 1.12 and the CLT, we can show that 
Vn(Ton—¥) a N(0,e77°6). By Definition 2.12(i), amser,, (0) = e~?90/n. 
Thus, the asymptotic relative efficiency of Ti, w.r.t. Ton is 


€T1n,Ton (0) = a/(e° i= 1); 


which is always less than 1. This shows that 7, is asymptotically more 
efficient than T\,. J 


The result for T>, in Example 2.36 is a special case (with U, = X) of 
the following general result. 


Theorem 2.6. Let g be a function on R* that is differentiable at 6 ¢ R* 
and let U, be a k-vector of statistics satisfying an(Un — 0) >a Y for a 
random k-vector Y with 0 < E||Y||? < co and a sequence of positive 
numbers {a,,} satisfying a, — co. Let T, = g(Un) be an estimator of 
0 = g(0). Then, the amse and asymptotic variance of T,, are, respectively, 
E{[Vg(O)I"¥}2/a2 and (Vo(9)]" Var(¥)V9(8)/a2. 


2.5.3 Asymptotic inference 


Statistical inference based on asymptotic criteria and approximations is 
called asymptotic statistical inference or simply asymptotic inference. We 
have previously considered asymptotic estimation. We now focus on asymp- 
totic hypothesis tests and confidence sets. 


140 2. Fundamentals of Statistics 


Definition 2.13. Let X = (X,...,X,) be a sample from P € P and 
Tn(X) be a test for Hp : P € Po versus Hy: P € Py. 

(i) If limsup,, ar, (P) < a for any P € Po, then a is an asymptotic signifi- 
cance level of T,. 

(ii) If limp —oo suppep, ar, (P) exists, then it is called the limiting size of 
Th. 

(iii) T,, is called consistent if and only if the type II error probability con- 
verges to 0, ie., limn—oo[l — ar, (P)] = 0, for any P € Pi. 

(iv) T;, is called Chernoff-consistent if and only if T,, is consistent and the 
type I error probability converges to 0, i-e., limn—oar,(P) = 0, for any 
P€éPo. Ty is called strongly Chernoff-consistent if and only if T;, is con- 
sistent and the limiting size of T,, is 0. I 


Obviously if T;, has size (or significance level) a for all n, then its limiting 
size (or asymptotic significance level) is a. If the limiting size of T,, is 
a € (0,1), then for any € > 0, T, has size a + € for all n > no, where no is 
independent of P. Hence T,, has level of significance a + € for any n > no. 
However, if Po is not a parametric family, it is likely that the limiting size 
of T;, is 1 (see, e.g., Example 2.37). This is the reason why we consider the 
weaker requirement in Definition 2.13(i). If T;, has asymptotic significance 
level a, then for any € > 0, ar, (P) < a+e for all n > no(P) but no(P) 
depends on P € Po; and there is no guarantee that T;, has significance level 
a-+e for any n. 


The consistency in Definition 2.13(iii) only requires that the type II er- 
ror probability converge to 0. We may define uniform consistency to be 
limn—oo SUPpep, [1 — a7, (P)] = 0, but it is not satisfied in most problems. 
If a € (0,1) is a pre-assigned level of significance for the problem, then a 
consistent test T,, having asymptotic significance level a is called asymptot- 
ically correct, and a consistent test having limiting size a is called strongly 
asymptotically correct. 


The Chernoff-consistency (or strong Chernoff-consistency) in Definition 
2.13(iv) requires that both types of error probabilities converge to 0. Math- 
ematically, Chernoff-consistency (or strong Chernoff-consistency) is better 
than asymptotic correctness (or strongly asymptotic correctness). After 
all, both types of error probabilities should decrease to 0 if sampling can be 
continued indefinitely. However, if a is chosen to be small enough so that 
error probabilities smaller than a can be practically treated as 0, then the 
asymptotic correctness (or strongly asymptotic correctness) is enough, and 
is probably preferred, since requiring an unnecessarily small type I error 
probability usually results in an unnecessary increase in the type II error 
probability, as the following example illustrates. 


Example 2.37. Consider the testing problem Hp : w < fo versus H; : 


2.5. Asymptotic Criteria and Inference 141 


Lt > bo based on iid. Xy,...,Xp with EX, =p € R. If each X; has the 
N(, 07) distribution with a known o?, then the test T.,, given in Example 
2.28 with cq = o21~-a/V/n + wo and a € (0,1) has size a (and, therefore, 
limiting size a). It also follows from (2.32) that for any ~ > Lo, 


Vn(Ho =1) , 


- (2.39) 


Ian, (2) = (2-04 
as n — oo. This shows that TJ. is consistent and, hence, is strongly 
asymptotically correct. Note that the convergence in (2.39) is not uniform 
in > po, but is uniform in ps > py for any fixed pu > Mo. 

Since the size of T., is a for all n, T., is not Chernoff-consistent. A 
strongly Chernoff-consistent test can be obtained as follows. Let 


Qn = 1— (nan), (2.40) 


where a,,’s are positive numbers satisfying a, — 0 and /nan — oo. Let 
Ty, be T., with a = a, for each n. Then, T;, has size ay. Since an — 0, 
The limiting size of T,, is 0. On the other hand, (2.39) still holds with a 
replaced by a,,. This follows from the fact that 


Bite: a 
fanny + VO Vi (an + ROE) Woe 


oO 


for any > po. Hence T,, is strongly Chernoff-consistent. However, if 
Gy <a, then, from the left-hand side of (2.39), 1 — ar,, (u) < 1— ar, (H) 
for any [4 > jo. 

We now consider the case where the population P is not in a parametric 
family. We still assume that 0? = Var(X;) is known. Using the CLT, we 
can show that for > Lo, 


lim [1 — ar, (u)] = lim & (<1-» + ees =0, 


n—0o n—-00 (on 


i.e., Ty, is still consistent. For u < po, 


lim ar, (u) =1— lim © (<1-» -- ee ; 


n—0o n—0o on 


which equals a if u = po and 0 if uw < po. Thus, the asymptotic significance 
level of T., is a. Combining these two results, we know that T., is asymp- 
totically correct. However, if P contains all possible populations on R with 
finite second moments, then one can show that the limiting size of T,, is 
1 (exercise). For a», defined by (2.40), we can show that T, = T., with 
Q@ = Q,, is Chernoff-consistent (exercise). But T,, is not strongly Chernoft- 
consistent if P contains all possible populations on R with finite second 
moments. IH 


142 2. Fundamentals of Statistics 


Definition 2.14. Let X = (X1,...,Xn) be a sample from P € P, 0 bea 
k-vector of parameters related to P, and C(X) be a confidence set for V. 
(i) If liminf, P() € C(X)) > 1-a for any P € P, then 1 — a is an 
asymptotic significance level of C(X). 

(ii) If limp oo infpep P(v € C(X)) exists, then it is called the limiting 
confidence coefficient of C(X). I 


Note that the asymptotic significance level and limiting confidence co- 
efficient of a confidence set are very similar to the asymptotic significance 
level and limiting size of a test, respectively. Some conclusions are also sim- 
ilar. For example, in a parametric problem one can often find a confidence 
set having limiting confidence coefficient 1— a € (0,1), which implies that 
for any € > 0, the confidence coefficient of C(X) is 1—a-—e for all n > no, 
where no is independent of P; in a nonparametric problem the limiting 
confidence coefficient of C(X) might be 0, whereas C(X) may have asymp- 
totic significance level 1 — a € (0,1), but for any fixed n, the confidence 
coefficient of C(X) might be 0. 


The confidence interval in Example 2.31 with ¢ = 02~q/2/\/n and the 
confidence set in Example 2.32 have confidence coefficient 1— a for any n 
and, therefore, have limiting confidence coefficient 1— a. If we drop the 
normality assumption and assume EX# < ov, then these confidence sets 
have asymptotic significance level 1—a; their limiting confidence coefficients 
may be 0 (exercise). 


2.6 Exercises 


1. Consider Example 2.3. Suppose that p(s) is constant. Show that X; 
and X;, i # j, are not uncorrelated and, hence, Xj,..., Xn are not 
independent. Furthermore, when y;’s are either 0 or 1, show that 
Z = Soi, Xi has a hypergeometric distribution and compute the 
mean of Z. 


2. Consider Example 2.3. Suppose that we do not require that the ele- 
ments in s be distinct, i.e., we consider sampling with replacement. 
Define a probability measure p and a sample (Xj,...,X,) such that 
(2.3) holds. If p(s) is constant, are X1,...,X, independent? If p(s) 
is constant and y;’s are either 0 or 1, what are the distribution and 
mean of Z = )7j_, Xi? 


3. Show that {Py : 0 € ©} is an exponential family and find its canonical 
form and natural parameter space, when 
(a) Pp is the Poisson distribution P(6), 0 € © = (0,00); 
(b) Po is the negative binomial distribution NB(6,r) with a fixed r, 


2.6. Exercises 143 


10. 


11. 


12. 


13. 


d6€0=(0,1); 
c) Pg is the exponential distribution E(a, 6) with a fixed a, 9 € O = 
0, 00); 

) 


Po is the gamma distribution I'(a, 7), 6 = (a, y) € 8 = (0,00) x 


e) Po is the beta distribution B(a, 3), 0 = (a, ZB) € 8 = (0,1) x (0, 1); 
f) Po is the Weibull distribution W(a, 0) with a fixeda >0,0€ O0= 


. Show that the family of exponential distributions E'(a,0) with two 


unknown parameters a and @ is not an exponential family. 


. Show that the family of negative binomial distributions N B(p,r) with 


two unknown parameters p and r is not an exponential family. 


. Show that the family of Cauchy distributions C(y,0) with two un- 


known parameters 4 and o is not an exponential family. 


. Show that the family of Weibull distributions W(a,6) with two un- 


known parameters a and @ is not an exponential family. 


. Is the family of log-normal distributions LN (1,07) with two unknown 


parameters j: and o? an exponential family? 


. Show that the family of double exponential distributions DE(w, 0) 


with two unknown parameters py and @ is not an exponential family, 
but the family of double exponential distributions DE(y,0) with a 
fixed 4 and an unknown parameter 6 is an exponential family. 


Show that the /-dimensional normal family discussed in Example 2.4 
is an exponential family. Identify the functions T’, 7, €, and h. 


Obtain the variance-covariance matrix for (X1,...,X,;) in Example 
2.7, using (a) Theorem 2.1(ii) and (b) direct computation. 


Show that the m.g.f. of the gamma distribution I'(a, 7) is (1—yt)~°, 
t <1, using Theorem 2.1 (ii). 


A discrete random variable X with 
P(X =x) = 7(x)0"/c(0), «2 =0,1,2,..., 


where y(x) > 0, 6 > 0, and c(@) = 0%. 4 7(x)6", is called a random 
variable with a power series distribution. 

(a) Show that {y(a)0”/c(0) : 8 > 0} is an exponential family. 

(b) Suppose that Xy,...,X, are ii.d. with a power series distribution 
y(x)0* /c(@). Show that S>j_, X; has the power series distribution 
yn (x)0* /[c(@)|", where 7,,(x) is the coefficient of 6” in the power series 
expansion of [c(@)]”. 


144 


14. 


15. 


16. 


17. 


18. 


19. 


20. 


21. 


2. Fundamentals of Statistics 


Let X be a random variable with a p.d.f. fg in an exponential family 
{Po : 0 € O} and let A be a Borel set. Show that the distribution 
of X truncated on A (i.e., the conditional distribution of X given 
X € A) hasap.df. fol4/Po(A) that is in an exponential family. 


Let {Pqysy i we Rk,S € My} be a location-scale family on R*. 
Suppose that Pyo,7,) has a Lebesgue p.d.f. that is always positive and 
that the mean and variance-covariance matrix of P(o,7,) are 0 and Ix, 
respectively. Show that the mean and variance-covariance matrix of 
Pd) are ps and &, respectively. 


Show that if the distribution of a positive random variable X is in a 
scale family, then the distribution of log X is in a location family. 


Let X be a random variable having the gamma distribution I'(a, y) 
with a known a and an unknown y > 0 and let Y =olog X. 

(a) Show that if o > 0 is unknown, then the distribution of Y is in a 
location-scale family. 

(b) Show that if o > 0 is known, then the distribution of Y is in an 
exponential family. 


Let X1,..., Xp be iid. random variables having a finite E|X,|* and 
let X and S$? be the sample mean and variance defined by (2.1) and 
(2.2). Express E(X*%), Cov(X, $7), and Var(S?) in terms of uy, = 
EX*, k = 1,2,3,4. Find a condition under which X and S? are 
uncorrelated. 


Let Xy,..., Xp be iid. random variables having the gamma distri- 
bution ['(a, yz) and Y1,...,Y, be ii.d. random variables having the 
gamma distribution I'(a,7,), where a > 0, yz > 0, and y, > 0. As- 
sume that X;’s and Y;’s are independent. Derive the distribution of 
the statistic X/Y, where X and Y are the sample means based on 
X;’s and Y;’s, respectively. 


Let Xj,..., Xp, be iid. random variables having the exponential dis- 
tribution E(a,@), a € R, and @ > 0. Show that the smallest order 
statistic, X(j), has the exponential distribution E(a,@/n) and that 
20 (Xi — X(1))/0 has the chi-square distribution x3,_ 9. 


Let (X1,¥V1),.-.;(Xn, Yn) be iid. random 2-vectors. Suppose that 
X, has the Cauchy distribution C(0,1) and given X, = a, Y; has 
the Cauchy distribution C(@x,1), where 3 € R. Let X and Y be 
the sample means based on X;’s and Y;’s, respectively. Obtain the 
marginal distributions of Y, Y — GX, and Y/X. 


2.6. Exercises 145 


22. 


23. 


24. 


25. 


26. 


27. 


Let X; = (Yi, Z;), 7 =1,...,n, be iid. random 2-vectors. The sample 
correlation coefficient is defined to be 


TS 


- 2), 
Ea Bge Y- 


where Y=n 00" Yj, Z=n 702 i, S2=(n-V ITLL (VP, 


and $2 =(n—-1)"1)77_(Zi-Z)?. 
(a) Assume that E|Y;|* < 00 and E|Z;|* < oo. Show that 


Val[T(X) — p] >a N(0,c’), 


where p is the correlation coefficient between Y; and Z, and c is a 
constant depending on some unknown parameters. 

(b) Assume that Y; and Z; are independently distributed as N (11, 0?) 
and N (12,03), respectively. Show that T has the Lebesgue p.d_f. 


f= EE YA ay 


(c) Assume the conditions in (b). Obtain the result in (a) using 
Scheffé’s theorem (Proposition 1.18). 


Let X1,..., Xn be iid. random variables with EX} < oo, T = (Y,Z), 
and T, = Y/VZ, where Y=n71 0 ele Zen oe, X?. 
(a) Show that /n(T — 0) -q N2(0,%) and /n(Ti — 0) a N(O, ey: 
Identify 0, ©, J, and c? in terms of moments of Xj. 

(b) Repeat (a) when X, has the normal distribution N(0, 07). 

(c) Repeat (a) when X, has the double exponential distribution 


D(0,¢). 


Prove the claims in Example 2.9 for the distributions related to order 
statistics. 


Show that if T is a sufficient statistic and T = w(S), where ~ is 
measurable and S is another statistic, then S' is sufficient. 


In the proof of Lemma 2.1, show that Co € C. Also, prove Lemma 
2.1 when P is dominated by a o-finite measure. 


Let X4,..., Xn be ii.d. random variables from Py € {Pp : 0 € O}. In 
the following cases, find a sufficient statistic for 9 € O that has the 
same dimension as 0. 

(a) Po is the Poisson distribution P(6), 0 € (0,00). 

(b) Po is the negative binomial distribution NB(6,r) with a known 
r,@€ (0,1). 


146 


28. 


29. 


30. 


3l. 


32. 


33. 


2. Fundamentals of Statistics 


(c) Po is the exponential distribution E(0,0), 6 € (0,00). 

(d) Po is the gamma distribution I'(a, y), 6 = (a, y) € (0, 00) x (0, co) 
(e) Po is the beta distribution B(a, 3), 6 = (a, 8) € (0,1) x (0,1). 
(f) Po is the log-normal distribution LN(y,07), 6 = (u,07) € Rx 
(0, 00). 

(g) Po is the Weibull distribution W(a,6) with a known a > 0, 0 € 
(0, oo). 


Let Xj, ...,X, bei.id. random variables from P(q,9), where (a, 0) € R? 
is a parameter. Find a two-dimensional sufficient statistic for (a, 0) 
in the following cases. 

(a) Pca,o) is the exponential distribution E(a,0), a € R, @ € (0,00). 

(b) Pra,o) is the Pareto distribution Pa(a,@), a € (0,00), 8 € (0,00). 


In Example 2.11, show that X(1) (or X(n)) is sufficient for a (or 6) if 
we consider a subfamily { f(a): @ < b} with a fixed b (or a). 


Let X and Y be two random variables such that Y has the binomial 
distribution Bi(az,N) and, given Y = y, X has the binomial distri- 
bution Bi(p, y). 

(a) Suppose that p € (0,1) and m € (0,1) are unknown and N is 
known. Show that (X,Y) is minimal sufficient for (p, 7). 

(b) Suppose that 7 and N are known and p € (0, 1) is unknown. Show 
whether X is sufficient for p and whether Y is sufficient for p. 


Let Xy,..., Xp be ii.d. random variables having a distribution P € 
P, where P is the family of distributions on R having continuous 
c.d.f.’s. Let T = (X(q), -.-, X(ny) be the vector of order statistics. Show 
that, given T’, the conditional distribution of X = (Xj,..., Xn) isa 
discrete distribution putting probability 1/n! on each of the n! points 
(Xi,,---;Xi,,) € R”, where {i1,...,2n} is a permutation of {1,...,n}; 
hence, T is sufficient for P € P. 


In Example 2.13 and Example 2.14, show that T is minimal sufficient 
for 0 by using Theorem 2.3(iii). 


A coin has probability p of coming up heads and 1 — p of coming 
up tails, where p € (0,1). The first stage of an experiment consists 
of tossing this coin a known total of M times and recording X, the 
number of heads. In the second stage, the coin is tossed until a total 
of X + 1 tails have come up. The number Y of heads observed in 
the second stage along the way to getting the X +1 tails is then 
recorded. This experiment is repeated independently a total of n 
times and the two counts (X;, Y;) for the ith experiment are recorded, 
i = 1,...,.n. Obtain a statistic that is minimal sufficient for p and 
derive its distribution. 


2.6. 


34 


35. 


36. 


37. 


38. 


39. 


40. 


Al. 


42. 


Exercises 147 


. Let Xy,..., Xn be iid. random variables having the Lebesgue p.d_f. 
aR 
fox) = exp { - (#)* - (0), 


where 0 = (1,0) € O = R x (0,00). Show that P = {Py : 0 € O} is 
an exponential family, where Pg is the joint distribution of Xj, ..., Xn, 
and that thecstatisne: T= (4 Ne Sy Gg OO ae) 
is minimal sufficient for 0 € O. 


Let X1,..., Xp be iid. random variables having the Lebesgue p.d_f. 


fo(x) = (20)~! [I(o,6) (x) + (20,30) (x)] ‘ 
Find a minimal sufficient statistic for @ € (0,00). 


Let X1,..., Xp be i.i.d. random variables having the Cauchy distribu- 
tion C(u,0) with unknown yp € R and o > 0. Show that the vector 
of order statistics is minimal sufficient for (y,c). 


Let X1,..., Xp be iid. random variables having the double exponen- 
tial distribution DE(, @) with unknown p € R and 6 > 0. Show that 
the vector of order statistics is minimal sufficient for (, 0). 


Let X1,...,X» be iid. random variables having the Weibull distribu- 
tion W(a,@) with unknown a > 0 and @ > 0. Show that the vector 
of order statistics is minimal sufficient for (a, 6). 


Let X1,..., Xp be i.i.d. random variables having the beta distribution 
B(G,) with an unknown (@ > 0. Find a minimal sufficient statistic 
for @. 


Let X),...,Xp, be ii.d. random variables having a population P in 
a parametric family indexed by (0,7), where 6 > 0, 7 = 1,2, and 
n > 2. When j = 1, P is the N(0,07) distribution. When j = 2, 
P is the double exponential distribution DE(0,@). Show that T = 
(Soe, X?, SL, |X;|) is minimal sufficient for (6, j). 


Let Xj,..., Xp, be i.i.d. random variables having a population P in a 
parametric family indexed by (0,7), where 6 € (0,1), 7 = 1,2, and 
n > 2. When j = 1, P is the Poisson distribution P(@). When j = 2, 
P is the binomial distribution Bz(6, 1). 

(a) Show that T = 5>;"_, X; is not sufficient for (0, 7). 

(b) Find a two-dimensional minimal sufficient statistic for (0,7). 


Let X be a sample from P € P = {fo; :0€ 0,j =1,...,k}, where 
fo,j’s are p.d.f.’s w.r.t. a common o-finite measure and 0 is a set of 
parameters. Assume that {a : foj;(x) > 0} C {x: fo.4(a) > O} for all 


148 


43. 


44. 


45. 


A6. 


AT. 
48. 


49. 


2. Fundamentals of Statistics 


@ and j = 1,...,4 —1. Suppose that for each fixed 7, T= T(X) isa 
statistic sufficient for 0. 

(a) Obtain a k-dimensional statistic that is sufficient for (0, 7). 

(b) Derive a sufficient condition under which T is minimal sufficient 
for (0,7). 


A box has an unknown odd number of balls labeled consecutively as 
6,—(@ — 1),...,-2,-1,0,1,2,...,(@ — 1),0, where 6 is an unknown 

nonnegative integer. A simple random sample Xj,...,Xp is taken 

without replacement, where X; is the label on the 7th ball selected 

and n < 20+1. 

(a) Find a statistic that is minimal sufficient for 6 and derive its 

distribution. 

(b) Show that the minimal sufficient statistic in (a) is also complete. 


Let X1,...,Xn be iid. random variables having the Lebesgue p.d-f. 
6-1¢e-@-9)/9 Ta 3)(2), where @ > 0 is an unknown parameter. 

(a) Find a statistic that is minimal sufficient for 0. 

(b) Show whether the minimal sufficient statistic in (a) is complete. 


Let Xy,...,Xn (n > 2) be iid. random variables having the normal 
distribution N(@,2) when 0 = 0 and the normal distribution N(6, 1) 
when 6 € R and 6 4 0. Show that the sample mean X is a complete 
statistic for 9 but it is not a sufficient statistic for 6. 


Let X be a random variable with a distribution Py in {P) : 6 € O}, 
fo be the p.d.f. of Py w.r.t. a measure v, A be an event, and P4 = 
{ fola/Po(A) :0€ oO}. 

(a) Show that if T(X) is sufficient for Py € P, then it is sufficient for 
Po € Pa. 

(b) Show that if T is sufficient and complete for Py € P, then it is 
complete for Pg € Pa. 


Show that (X(1), X(,)) in Example 2.13 is not complete. 


Let T be a complete (or boundedly complete) and sufficient statistic. 
Suppose that there is a minimal sufficient statistic S. Show that T is 
minimal sufficient and S is complete (or boundedly complete). 


Let T and S be two statistics such that S = 7)(T) for a measurable 
wp. Show that 

(a) if T is complete, then S is complete; 

(b) if T is complete and sufficient and w is one-to-one, then S is 
complete and sufficient; 

(c) the results in (a) and (b) still hold if the completeness is replaced 
by the bounded completeness. 


2.6. Exercises 149 


50. 


51. 
52. 


53. 


54. 


55. 


56. 


57. 


Find complete and sufficient statistics for the families in Exercises 27 
and 28. 


Show that (X(1),X(n)) in Example 2.11 is complete. 


Let (X41, V1), ...; (Xn, Yn) be i-i.d. random 2-vectors having the follow- 
ing Lebesgue p.d-f. 


fo(x,y) = (207?)"To.a (V@—aP+W—OP), (ey) ER’, 


where 0 = (a,b, y) € R? x (0,00). 

(a) If a= 0 and b = 0, find a complete and sufficient statistic for y. 
(b) If all parameters are unknown, show that the convex hull of the 
sample points is a sufficient statistic for 0. 


Let X be a discrete random variable with p.d_f. 


0 z=0 
fo(xz)= 4% (1-—0)267-1 Ye ee 
0 otherwise, 


where @ € (0,1). Show that X is boundedly complete, but not com- 
plete. 


Show that the sufficient statistic T in Example 2.10 is also complete 
without using Proposition 2.1. 


Let Yj,...,Yn be ii.d. random variables having the Lebesgue p.d.f. 
Ax 19,1) (2) with an unknown X > 0 and let Z,...,Z, be iid. 
discrete random variables having the power series distribution given 
in Exercise 13 with an unknown @ > 0. Assume that Y;’s and Z;’s 
are independent. Let X; = Y; + Z, 1 = 1,...,n. Find a complete 
and sufficient statistic for the unknown parameter (0, A) based on the 
sample X = (X},..., Xn). 


Suppose that (X1,Y1),...,(Xn, Yn) are iid. random 2-vectors and 
X; and Y; are independently distributed as N(w,0%) and N(u, 07), 
respectively, with 6 = (u,0%,0%) € R x (0,00) x (0,00). Let X and 
S%. be the sample mean and variance given by (2.1) and (2.2) for X;’s 
and Y and S?. be the sample mean and variance for Y;’s. Show that 
T = (X,Y, S%, $2) is minimal sufficient for 0 but T is not boundedly 
complete. 


Let X1,..., Xn be iid. from the N(6,6?) distribution, where 6 > 0 
is a parameter. Find a minimal sufficient statistic for 6 and show 
whether it is complete. 


150 


58. 


59. 


60. 


61. 


62. 


63. 


64. 


2. Fundamentals of Statistics 


Suppose that (X1,Y1),...,(Xn, Yn) are iid. random 2-vectors having 
the normal distribution with EX, = EY, = 0, Var(X,) = Var(Y1) = 
is and Cov(X1, 1) =OE (-1, 1). 

(a) Find a minimal sufficient statistic for 0. 

(b) Show whether the minimal sufficient statistic in (a) is complete 
or not. 

(c) Prove that T; = )7j, X? and Tp = >;_, Y? are both ancillary 
but (7), T2) is not ancillary. 


Let Xj,..., Xp, be iid. random variables having the exponential dis- 
tribution E(a, 6). 

(a) Show that )>j.,(X; — X(1)) and Xq) are independent for any 
(a, 4). 

(b) Show that Zi = (X(n) =F Xi) /(X(n) = Mie 1),)3 t= il, tee 
are independent of (X(1), 0), (Xi — X(1))). 


n—2 


’ 


Let Xj,..., Xp, be iid. random variables having the gamma distri- 
bution I'(a,y). Show that 57j_, X; and 577, [log X; — log X(1)| are 
independent for any (a,7). 


Let Xj,..., Xp be iid. random variables having the uniform distri- 
bution on the interval (a,b), where —co < a < b < oo. Show 
that (Xq) — Xqay)/(Xim) — Xa)), t = 2,..,n — 1, are independent 
of (X(1), X(n)) for any a and b. 


Consider Example 2.19. Assume that n > 2. 

(a) Show that X is better than T; if P = N(0,07),0ER,o>0. 
(b) Show that T; is better than X if P is the uniform distribution on 
the interval ( —4,0+4),0ER. 7 

(c) Find a family P for which neither X nor T; is better than the 
other. 


Let X1,...,Xp be iid. from the N(p,07) distribution, where pw € R 
and a > 0. Consider the estimation of o? with the squared error loss. 
Show that 415? is better than S*, the sample variance. Can you 
find an estimator of the form cS? with a nonrandom c such that it is 
better than 2+S?? 


Let Xj,..., Xp, be i.i.d. binary random variables with P(X; = 1) =6 € 
(0,1). Consider estimating 6 with the squared error loss. Calculate 
the risks of the following estimators: 

(a) the nonrandomized estimators X (the sample mean) and 


if more than half of X;’s are 0 
if more than half of X;’s are 1 
if exactly half of X;’s are 0; 


To(X) = 


NIE Re © 


2.6. Exercises 151 


65. 


66. 


67. 


68. 


69. 


(b) the randomized estimators 


pe with probability 4 
Ti(X) = i 
i(X) { To with probability 4 
and “ Z 
xX with probability X 
To(X) = % 
2(X) { s with probability 1— X. 


Let Xj,..., Xn be iid. random variables having the exponential dis- 
tribution E(0,6), @ € (0,00). Consider estimating @ with the squared 
error loss. Calculate the risks of the sample mean X and cX (1); where 
c is a positive constant. Is X better than cX(1) for some c? 


Consider the estimation of an unknown parameter 6 > 0 under the 
squared error loss. Show that if T and U are two estimators such that 
T <U and Rr(P) < Ru(P), then Rr, (P) < Ru, (P), where Rr(P) 
is the risk of an estimator T and T; denotes the positive part of T. 


Let Xj,..., Xn be iid. random variables having the exponential dis- 
tribution E(0,0), 0 € (0,00). Consider the hypotheses 


Ho :0<05 versus H,:0> 6, 


where 69 > 0 is a fixed constant. Obtain the risk function (in terms 


of @) of the test rule T.(X) = I(c,oc)(X), under the 0-1 loss. 


Let X1,..., Xp be i.i.d. random variables having the Cauchy distribu- 
tion C(y, 7) with unknown yp € Rando > 0. Consider the hypotheses 


Ho: u<po versus Hy: p> Mo, 


where [Uo is a fixed constant. Obtain the risk function of the test rule 
Te(X) = I(c,oo)(X), under the 0-1 loss. 


Let X1,..., Xn be iid. binary random variables with P(X; = 1) = 0, 
where @ € (0,1) is unknown and n is an even integer. Consider the 
problem of testing Hp : 8 < 0.5 versus H, : 6 > 0.5 with action space 
{0,1} (0 means Ho is accepted and 1 means Hj is accepted). Let 
the loss function be L(6,a) = 0 if H; is true and a = j, j = 0,1; 
L(6,0) = Co when @ > 0.5; and L(6,1) = Cy when @ < 0.5, where 
Co > C; > 0 are some constants. Calculate the risk function of the 
following randomized test (decision rule): 


if more than half of X;,’s are 0 
if more than half of X;’s are 1 
if exactly half of X;’s are 0. 


T= 


NIE © 


152 


70. 


71. 


72. 


73. 


74. 


795. 


76. 


2. Fundamentals of Statistics 


Consider Example 2.21. Suppose that our decision rule, based on 
a sample X = (X1,..., Xn) with iid. components from the N(6,1) 
distribution with an unknown @ > 0, is 


ay by < X 
T(X)= a2 bo < X <b, 
a3 X < bo. 


Express the risk of T in terms of 6. 


Consider an estimation problem with P = {P, : 6 € O} (a parametric 
family), A = ©, and the squared error loss. If @) € © satisfies that 
Po < Po, for any 6 € O, show that the estimator T = 6 is admissible. 


Let S be a class of decision rules. A subclass So C & is called S- 
complete if and only if, for any T € S and T ¢ So, there is a Tp € So 
that is better than T, and So is called S-minimal complete if and 
only if So is S-complete and no proper subclass of Sg is S-complete. 
Show that if a S-minimal complete class exists, then it is exactly the 
class of S-admissible rules. 


Let Xj,..., Xn be i.i.d. random variables having a distribution P € P. 
Assume that EX? < oo. Consider estimating 4. = EX, under the 
squared error loss. 

(a) Show that any estimator of the form aX +0 is inadmissible, where 
X is the sample mean, a and b are constants, and a > 1. 

(b) Show that any estimator of the form X +b is inadmissible, where 
b £0 is a constant. 


Consider an estimation problem with ¥ € [c,d] C R, where c and d 
are known. Suppose that the action space is A D [c,d] and the loss 
function is L(|# — a|), where L(-) is an increasing function on [0, co). 
Show that any decision rule T with P(T(X) ¢ [c,d]) > 0 for some 
P © P is inadmissible. 


Suppose that the action space is (Q,B%), where 2 € B*. Let X 
be a sample from P € P, d9(X) be a nonrandomized rule, and T 
be a sufficient statistic for P € P. Show that if E[Z4(d9(X))|T] is a 
nonrandomized rule, i.e., E[L4(5o(X))|T] = La(h(T)) for any A € BE, 
where h is a Borel function, then 69(X) = h(T(X)) as. P. 


Let T, 69, and 6; be as given in the statement of Proposition 2.2. 
Show that 


/ L(P,a)d5,(X,a) = E | / L(P, a)d5o(X,a) 


A A 


r| as. P. 


2.6. 


77 
78 


79. 


80. 


81. 


82. 


83. 


Exercises 153 


. Prove Theorem 2.5. 


. In Exercise 64, use Theorem 2.5 to find decision rules that are better 
than T;, 7 = 0,1, 2. 


In Exercise 65, use Theorem 2.5 to find a decision rule better than 
cX (1). 


Consider Example 2.22. 

(a) Show that there is no optimal rule if S contains all possible esti- 
mators. (Hint: consider constant estimators.) 

(b) Find a S-optimal rule if X1,..., X, are independent random vari- 
ables having a common mean p and Var(X;) = o?/a; with known aj, 
t= 1a 

(c) Find a S9-optimal rule if X1,..., X» are identically distributed but 
are correlated with a common correlation coefficient p. 


Let Xi; = uta + ej,t=1,...,.m, 7 =1,...,n, where a;’s and «;;’s 
are independent random variables, a; is N(0,02), €;; is N(0,02), and 
ut, 02, and o? are unknown parameters. Define X; = n~! ees 
X = m1", Xi, MSA = n(m — 1)71 0", (Xi — X)?, and MSE 
=m *(n—1)71 Ey jaa (Kay — XG)?. Assume that m(n—1) > 4. 
Consider the following class of estimators of 6 = 02/0?: 


- 1 MSA 
{a6)=1 [9841] sen} 
(a) Show that MSA and MSE are independent. 
(b) Obtain a 6 € R such that 6(6) is unbiased for 6. 
(c) Show that the risk of 6(6) under the squared error loss is a func- 
tion of (6, 8). 
(d) Show that there is a constant 6* such that for any fixed 6, the risk 
of @(4) is strictly decreasing in 6 for 6 < 6* and strictly increasing for 
b> d*. 
(e) Show that the unbiased estimator of 6 derived in (b) is inadmis- 
sible. 


Let To(X) be an unbiased estimator of J in an estimation problem. 
Show that any unbiased estimator of ¥ is of the form T(X) = To(X)— 
U(X), where U(X) is an “unbiased estimator” of 0. 


Let X be a discrete random variable with 
P(X ==1) =p, P(X =k) = (1—p)*p*, k =0,1,2,..., 


where p € (0,1) is unknown. 
(a) Show that U(X) is an unbiased estimator of 0 if and only if U(k) = 


154 


84. 


85. 


86. 


87. 


88. 


89. 


2. Fundamentals of Statistics 


ak for all k = —1,0,1,2,... and some a. 

(b) Show that Tp(X) = Iyo}(X) is unbiased for ) = (1—p)? and that, 
under the squared error loss, Tp is a S-optimal rule, where & is the 
class of all unbiased estimators of v. 

(c) Show that To(X) = J¢_1}(X) is unbiased for ) = p and that, 
under the squared error loss, there is no S-optimal rule, where & is 
the class of all unbiased estimators of . 


(Nonexistence of an unbiased estimator). Let X be a random variable 
having the binomial distribution Bi(p,n) with an unknown p € (0, 1) 
and a known n. Consider the problem of estimating 0 = p~!. Show 
that there is no unbiased estimator of ¥. 


Let X1,..., Xp be i.i.d. random variables having the normal distribu- 
tion N (0,1), where @ = 0 or 1. Consider the estimation of 0. 

(a) Let S be the class of nonrandomized rules (estimators), i.e., esti- 
mators that take values 0 and 1 only. Show that there does not exist 
any unbiased estimator of @ in S. 

(b) Find an estimator in & that is approximately unbiased. 


Let Xj,...,Xp be iid. from the Poisson distribution P(@) with an 
unknown 6 > 0. Find the bias and mse of T;,, = (1 — a/n)"* as an 
estimator of ¥ = e~%, where a 4 0 is a known constant. 


Let Xq,..., Xn be iid. (n > 3) from N(,07), where p > 0 anda > 0 
are unknown parameters. Let T, = X/S be an estimator of j1/o and 
T. = X? be an estimator of ?, where X and S? are the sample mean 
and variance, respectively. Calculate the mse’s of T, and 7). 


Consider a location family {P, : » € R*} on R*, where Py, = Pwu,1,) 
is given in (2.10). Let lo € R* be a fixed vector and L(P,a) = 
L(||u — all), where a € A = R* and L(-) is a nonnegative Borel 
function on [0,00). Show that the family is invariant and the decision 
problem is invariant under the transformation g(X) = X+clo,c€ R. 
Find an invariant decision rule. 


Let X1,..., Xn be iid. from the N(, 07) distribution with unknown 
p € Rand o? > 0. Consider the scale transformation aX, a € (0,00). 
(a) For estimating o? under the loss function L(P,a) = (1 — a/o?)?, 
show that the problem is invariant and that the sample variance S? 
is invariant. 

(b) For testing Ho : 4 < 0 versus Hy : js > 0 under the loss 


1(P,0) = “Xoey(#) and £(P,1) = Mr... a(n), 


show that the problem is invariant and any test that is a function of 


X/\/S2/n is invariant. 


2.6. Exercises 155 


90. 


91. 


92. 


93. 


94. 


Let X1,...,X» be ii.d. random variables having the c.d-f. F(a — 6), 
where F is symmetric about 0 and 6 € R is unknown. 

(a) Show that the c.d.f. of 0", wiX(j) — 6 is symmetric about 0, 
where X,(,) is the 7th order statistic and w;’s are constants satisfying 
Wi = Wn—iti and 7 oe =e 

(b) Show that >>}, w;X() in (a) is unbiased for 6 if the mean of F 
exists. 

(c) Show that $7)", w; Xi) is location invariant when D>", wi = 1. 


In Example 2.25, show that the conditional distribution of @ given 
X =a is N(ux(x),c?) with p(x) and c? given by (2.25). 


A median of a random vauell Y (or its distribution) is any value m 
such that P(Y <m) > 4 and P(Y > m) > 3. 

(a) Show that the set or medians is a closed natecral [mo, ma]. 

(b) Suppose that E|Y| < oo. If c is not a median of Y, show that 
E|Y —c| > E|Y —m| for any median m of Y. 

(c) Let X be a sample from Ps, where 6 € O C R. Consider the 
estimation of 6 under the absolute error loss function |a — 6|. Let I 
be a given distribution on O with finite mean. Find the S-Bayes rule 
w.r.t. II, where & is the class of all rules. 


(Classification). Let X be a sample having a p.d.f. f;(z) w.r.t. a o- 
finite measure v, where j is unknown and j € {1,..., J} with a known 
integer J > 2. Consider a decision problem in which the action space 
A = {1,..., J} and the loss function is 


0 ifa=j 

Lja) = { 1 ifa#j. 
(a) Let S be the class of all nonrandomized decision rules. Obtain 
the risk of ad ES 
(b) Let II be a probability measure on {1,..., J} with I({j}) = 7, 
j =1,...,J. Obtain the Bayes risk of 6 € S w.r.t. IL. 
(c) Obtain a S-Bayes rule w.r.t. II in (b). 
(d) Assume that J = 2, m1 = m2 = 0.5, and f;(x) = ¢(a — u,;), where 
?(z) is the p.d.f. of the standard normal distribution and y;, 7 = 1,2, 
are known constants. Obtain the Bayes rule in (c) and compute the 
Bayes risk. 
(e) Obtain the risk and the Bayes risk (w.r.t. Il in (b)) of a randomized 
decision rule. 
(f) Obtain a Bayes rule w.r.t. IT. 
(g) Obtain a minimax rule. 


Let 6 be an unbiased estimator of an unknown 6 € R. : 
(a) Under the squared error loss, show that the estimator 0 + c is not 


156 


95. 


96. 


97. 


98. 


99. 


2. Fundamentals of Statistics 


minimax unless sup, Rr(#) = oo for any estimator T, where c # 0 is 
a known constant. 

(b) Under the squared error loss, show that the estimator c6 is not 
minimax unless supg Rr(A) = oo for any estimator T’, where c € (0, 1) 
is a known constant. 

(c) Consider the loss function L(0,a) = (a—6)?/6? (assuming @ 4 0). 
Show that 6 is not minimax unless supy Rr(0) = oo for any T. 


Let X be a binary observation with P(X = 1) = 6, or 62, where 
0 < 0, < @ < 1 are known values. Consider the estimation of 0 
with action space {a1,@2} and loss function L(6;,a;) = lj, where 
lor > le > li = leg = 0. For a decision rule 6(X), the vector 
(Rs(01), Rs(A2)) is defined to be its risk point. 

(a) Show that the set of risk points of all decision rules is the convex 
hull of the set of risk points of all nonrandomized rules. 

(b) Find a minimax rule. 

(c) Let II be a distribution on {61,02}. Obtain the class of all Bayes 
rules w.r.t. Il. Discuss when there is a unique Bayes rule. 


Consider the decision problem in Example 2.23. 

(a) Let I be the uniform distribution on (0,1). Show that a S-Bayes 
rule w.r.t. IL is T;-(X), where j* is the largest integer in {0, 1, ....n—1} 
such that Bj+1.n—741(00) = 4 and Ba,»(-) denotes the c.d.f. of the beta 
distribution B(a, b). 

(b) Derive a S-minimax rule. 


Let X1,...,Xn be iid. from the N(y, 07) distribution with unknown 
pw € Rand o? > 0. To test the hypotheses 


Ho: < bo versus Ay: > bo, 


where ji9 is a fixed constant, consider a test of the form T,(X) = 
T(c,00) (Ty), Where Ty, = (X — uo)/./$2/n and c is a fixed constant. 
(a) Find the size of T,. (Hint: T),, has the t-distribution t,_1.) 

(b) If @ is a given level of significance, find a cq such that T., has 
size a. 

(c) Compute the p-value for T,,, derived in (b). 

(d) Find a cq such that [X — ca1/S2/n, X + ¢av/'S2/n] is a confidence 
interval for ps with peanee coefficient 1— a. What is the expected 
interval length? 


In Exercise 67, calculate the size of T.(X); find a ca such that Ty, 
has size a, a given level of significance; and find the p-value for T..,. 


In Exercise 68, assume that o is known. Calculate the size of T.(X); 
find a cq such that T., has size a, a given level of significance; and 
find the p-value for Ty, 


2.6. Exercises 157 


100. 


101. 


102. 


103. 


104. 


105. 


106. 


107. 


Let a € (0,1) be given and T;,4(X) be the test given in Example 2.30. 
Show that there exist integer j and q € (0,1) such that the size of 
Tyg is a. 


Let X1,..., Xp be i.i.d. from the exponential distribution E(a, 0) with 
unknown a € R and @ > 0. Let a € (0,1) be given. 

(a) Using Ti(X) = S07, (Xi — X(1)), construct a confidence interval 
for @ with confidence coefficient 1 — a and find the expected interval 
length. 
(b) Using T)(X) and T2(X) = Xi), construct a confidence interval 
for a with confidence coefficient 1 — a and find the expected interval 
length. 

(c) Using the method in Example 2.32, construct a confidence set for 
the two-dimensional parameter (a, @) with confidence coefficient 1—a. 


Suppose that X is a sample and a statistic T(X) has a distribution 
in a location family {P, : uw € R}. Using T(X), derive a confidence 
interval for jz with level of significance 1— a and obtain the expected 
interval length. Show that if the c.d.f. of T(X) is continuous, then we 
can always find a confidence interval for w with confidence coefficient 
1— a for any a € (0,1). 


Let X = (Xj,...,Xn) be a sample from Py, where 6 € {04,..., Ox} 
with a fixed integer k. Let T,,(X) be an estimator of 0 with range 
{01,..., Ox}. 

(a) Show that T;,(X) is consistent if and only if Ps(T,(X) = 0) > 1. 
(b) Show that if T,,(X) is consistent, then it is a,-consistent for any 


{an}. 


Let X4,...,X, be iid. from the uniform distribution on (6 — 4, O+ 4), 
where 6 € R is unknown. Show that (X(1) + X(,))/2 is strongly 
consistent for @ and also consistent in mse. 


Let Xj,...,Xp be ii.d. from a population with the Lebesgue p.d.f. 
fo(x) = 27*(1 + Ox)I(_1,1)(x), where @ € (—1,1) is an unknown pa- 
rameter. Find a consistent estimator of 6. Is your estimator \/n- 
consistent? 


Let X1,..., Xp be iid. observations. Suppose that T;, is an unbiased 
estimator of ? based on Xj,...,X, such that for any n, Var(T;,) < oo 
and Var(T;,) < Var(U,) for any other unbiased estimator U,, of J 
based on X1,..., Xn. Show that 7), is consistent in mse. 


Consider the Bayes rule u.(X) in Example 2.25. Show that p(X) is 
a strongly consistent, \/n-consistent, and L2-consistent estimator of 
pt. What is the order of the bias of y.(X) as an estimator of 1? 


158 


108. 


109. 
110. 


111. 


112. 


113. 


114. 


115. 


2. Fundamentals of Statistics 


In Exercise 21, show that 

(a) Y/X is an inconsistent estimator of (3; 

(b) B= Z(m) is a consistent estimator of 3, where m = n/2 when n 
is even, m = (n + 1)/2 when n is odd, and Z(,) is the ith smallest 
value of Y;/X;, i =1,...,n. 


Show that the estimator To of @ in Exercise 64 is inconsistent. 


Let 91, g2,-.. be continuous functions on (a,b) C R such that g,(x%) 
g(x) uniformly for x in any closed subinterval of (a,b). Let T, be a 
consistent estimator of @ € (a,b). Show that g,(T,,) is consistent for 
0 = g(0). 


Let Xj,..., Xp, be ii.d. from P with unknown mean yp € R and vari- 
ance a? > 0, and let g(u) = 0 if u 4 0 and g(0) = 1. Find a consistent 
estimator of 0 = g(). 


Establish results for the smallest order statistic X(;) (based on i.i.d. 
random variables X,,...,X,) similar to those in Example 2.34. 


(Consistency for finite population). In Example 2.27, show that Y —p 
Y asn — N for any fixed N and population. Is Y still consistent if 
sampling is with replacement? 


Assume that X; = 6t; + e;, i = 1,...,n, where 0 € © is an unknown 
parameter, O is a closed subset of R, e;’s are i.i.d. on the interval 
[—7T, 7] with some unknown 7 > 0 and Ee; = 0, and ¢;’s are fixed 
constants. Let 


where 


Sn(¥) = 2max |X; — yti|/V14+7?. 


(a) Assume that sup; |t;| < oo and sup, ¢; — inf; t; > 27. Show that 
the sequence {6,,n = 1,2,...} is bounded as. 
(b) Let 6, € 0, n = 1,2,.... If 0, — 0, show that 


Sn(On) — Sn(9) = O(|On — 8) as. 


(c) Under the conditions in (a), show that T,, is a strongly consistent 
estimator of 0 = min,ce S(y), where $(7) = limn—oo Sn(7) as. 


Let Xj,...,Xn be iid. random variables with EX? < oo and X be 
the sample mean. Consider the estimation of uw = EX}. 

(a) Let T, = X + £,/\/n, where €, is a random variable satisfying 
€, = 0 with probability 1 — n~! and €, = n°/? with probability n-!. 


2.6. Exercises 159 


116. 


117. 


118. 


119. 


120. 
121. 


122. 


Show that br, (P) # br, (P) for any P. 

(b) Let T, = X + n/Vn, where mp, is a random variable that is 
independent of X,,...,X and equals 0 with probability 1—2n7! and 
+,/n with probability n~!. Show that amser,(P) = amsex(P) = 
msex(P) and mser, (P) > amser, (P) for any P. 


Let X1,...,X»y be iid. random variables with finite 6 = EX, and 
Var(X1) = 6, where 0 > 0 is unknown. Consider the estimation of 
0 = V0. Let Tin = VX and To, = X/S, where X and S? are the 
sample mean and sample variance. 

(a) Obtain the n~! order asymptotic biases of T,, and Tz, according 
to (2.38). 

(b) Obtain the asymptotic relative efficiency of T, w.r.t. Ton. 


Let X1,..., Xn be ii.d. according to N(u,1) with an unknown py € R. 
Let 0 = P(X, < c) for a fixed constant c. Consider the following 
estimators of 0: Ti, = F,(c), where F,, is the empirical c.d.f. defined 
in (2.28), and To, = ®(c — X), where © is the c.d.f. of N(0,1). 

(a) Find the n~! order asymptotic bias of Tz, according to (2.38). 


(b) Find the asymptotic relative efficiency of Ty, w.r.t. Tan. 


Let Xj,...,Xn be iid. from the N(0,07) distribution with an un- 
known o > 0. Consider the estimation of 0 = 0. Find the asymptotic 


relative efficiency of \/7/2 7", |Xi|/n w.r.t. (0, X?/n)¥/?. 


Let Xj,...,Xn be iid. from P with EX} < oo and unknown mean 
pu € R and variance o? > 0. Consider the estimation of 0 = py? and 
the following three estimators: T;, = X?, Ton = X? — $?/n, T3n = 
max{0,7T2,}, where X and S$? are the sample mean and variance. 
Show that the amse’s of Tjn, 7 = 1,2,3, are the same when pu ¥ 0 but 
may be different when = 0. Which estimator is the best in terms 
of the asymptotic relative efficiency when pu = 0? 


Prove Theorem 2.6. 


Let X4,...,X, be iid. with EX; = uw, Var(X;) = 1, and EX} < oo. 
Let Tin = n-1 oj, X? —1 and Th, = X? —n~! be estimators of 
o= p?. 

(a) Find the asymptotic relative efficiency of T1,, w.r.t. Ton. 

(b) Show that e7,,\7,,,(P) < 1 if the c.d-f. of X; — wu is symmetric 
about 0 and p 4 0. 

(c) Find a distribution P for which e7,, 7,,,(P) > 1. 


Let Xj,..., Xp be ii.d. binary random variables with unknown p = 
P(X; = 1) € (0,1). Consider the estimation of p. Let a and b be 
two positive constants. Find the asymptotic relative efficiency of the 
estimator (a+ nX)/(a+b+n) wrt. X. 


160 


123. 


124. 


125. 


126. 


127. 


128. 


129. 


130. 


2. Fundamentals of Statistics 


Let X1,...,Xn be iid. from N(p,07) with an unknown p € R anda 
known o?. Let T; = X be the sample mean and T> = p(X) be the 
Bayes estimator given in (2.25). Assume that EX} < oo. 

(a) Calculate the exact mse of both estimators. Can you conclude 
that one estimator is better than the other in terms of the mse? 

(b) Find the asymptotic relative efficiency of T; w.r.t. To. 


In Example 2.37, show that 

(a) the limiting size of T., is 1 if P contains all possible populations 
on R with finite second moments; 

(b) T, = Te, with a = ay (given by (2.40)) is Chernoff-consistent; 
(c) Tp, in (b) is not strongly Chernoff-consistent if P contains all 
possible populations on R with finite second moments. 


Let X1,...,Xp be iid. with unknown mean pw € R and variance 
o? > 0. For testing Ho : w < po versus Hy : fs > po, consider 
the test T,, obtained in Exercise 97(b). 

(a) Show that T., has asymptotic significance level a and is consis- 
tent. 


(b) Find a test that is Chernoff-consistent. 


Consider the test T; in Example 2.23. For each n, find a 7 = j, such 
that T;,, has asymptotic significance level a € (0,1). 


Show that the test T,, in Exercise 98 is consistent, but T.,, in Exercise 
99 is not consistent. 


In Example 2.31, suppose that we drop the normality assumption but 
assume that 4 = EX; and o? = Var(X;) are finite. 

(a) Show that when o? is known, the asymptotic significance level 
of the confidence interval [X — ca,X + ca] is 1 — a, where cq = 
O2Z1-0/2//n and zg = ®-!(a). 

(b) Show that when o? is known, the limiting confidence coefficient 
of the interval in (a) might be 0 if P contains all possible populations 
on R. 

(c) Show that the confidence interval in Exercise 97(d) has asymptotic 
significance level 1 — a. 


Let X1,...,Xy be i.i.d. with unknown mean p € R and variance 0? > 
0. Assume that EX} < oo. Using the sample variance $?, construct a 
confidence interval for o? that has asymptotic significance level 1—a. 


Consider the sample correlation coefficient T defined in Exercise 22. 
Construct a confidence interval for p that has asymptotic significance 
level 1 — a, assuming that (Y;, Z;) is normally distributed. (Hint: 
show that the asymptotic variance of T is (1 — p”)?.) 


Chapter 3 


Unbiased Estimation 


Unbiased or asymptotically unbiased estimation plays an important role 
in point estimation theory. Unbiasedness of point estimators is defined in 
§2.3.2. In this chapter, we discuss in detail how to derive unbiased esti- 
mators and, more importantly, how to find the best unbiased estimators in 
various situations. Although an unbiased estimator (even the best unbiased 
estimator if it exists) is not necessarily better than a slightly biased esti- 
mator in terms of their mse’s (see Exercise 63 in §2.6), unbiased estimators 
can be used as “building blocks” for the construction of better estimators. 
Furthermore, one may give up the exact unbiasedness, but cannot give up 
asymptotic unbiasedness since it is necessary for consistency (see §2.5.2). 
Properties and the construction of asymptotically unbiased estimators are 
studied in the last part of this chapter. 


3.1 The UMVUE 


Let X be a sample from an unknown population P € P and ¥ be a real- 
valued parameter related to P. Recall that an estimator T(X) of J is 
unbiased if and only if E[T(X)] = 0 for any P € P. If there exists an 
unbiased estimator of 7, then V is called an estimable parameter. 


Definition 3.1. An unbiased estimator T(X) of J is called the uni- 
formly minimum variance unbiased estimator (UMVUE) if and only if 
Var(T(X)) < Var(U(X)) for any P € P and any other unbiased estimator 
U(X)of }. I 


Since the mse of any unbiased estimator is its variance, a UMVUE is 
S-optimal in mse with S being the class of all unbiased estimators. One 


161 


162 3. Unbiased Estimation 


can similarly define the uniformly minimum risk unbiased estimator in sta- 
tistical decision theory when we use an arbitrary loss instead of the squared 
error loss that corresponds to the mse. 


3.1.1 Sufficient and complete statistics 


The derivation of a UMVUE is relatively simple if there exists a sufficient 
and complete statistic for P € P. 


Theorem 3.1 (Lehmann-Scheffé theorem). Suppose that there exists a 
sufficient and complete statistic T(X) for P € P. If 0 is estimable, then 
there is a unique unbiased estimator of 0 that is of the form h(T) with a 
Borel function h. (Two estimators that are equal a.s. P are treated as one 
estimator.) Furthermore, h(T') is the unique UMVUE of 0). I 


This theorem is a consequence of Theorem 2.5(ii) (Rao-Blackwell the- 
orem). One can easily extend this theorem to the case of the uniformly 
minimum risk unbiased estimator under any loss function L(P,a) that is 
strictly convex in a. The uniqueness of the UMVUE follows from the com- 
pleteness of T(X). 

There are two typical ways to derive a UMVUE when a sufficient and 
complete statistic T is available. The first one is solving for h when the 
distribution of T is available. The following are two typical examples. 


Example 3.1. Let Xj,...,X, be i.i.d. from the uniform distribution on 
(0,0), 0 > 0. Let ? = g(6), where g is a differentiable function on (0,00). 
Since the sufficient and complete statistic X(n) has the Lebesgue p.d_-f. 
nO—"x"~*I(o.9)(x), an unbiased estimator h(X(n)) of J must satisfy 


0 
6" g(0) = nf h(x)a"-'dx — for all 0 > 0. 
0 


Differentiating both sizes of the previous equation and applying the result 
of differentiation of an integral (Royden (1968, §5.3)) lead to 


no”"—1g(0) + 0" g'(0) = nh(9)0"—". 


Hence, the UMVUE of @ is A(X(ny) = g(X(n)) + 27'X(nyg'(X(ny) In 
particular, if 0 = 0, then the UMVUE of @ is (1+n7')X(,). 0 


Example 3.2. Let X1,...,X, be i.i.d. from the Poisson distribution P(@) 
with an unknown 6 > 0. Then T(X) = 7"_, X; is sufficient and complete 
for 0 > 0 and has the Poisson distribution P(n@). Suppose that 0 = g(@), 
where g is a smooth function such that g(x) = 0 ajt},z > 0. An 


3.1. The UMVUE 163 


unbiased estimator h(T') of 0 must satisfy 


S A(t)nt i 
yO at = er a(0) 
t=0 : 
=e 48 
k=0 °° j=0 


co 
-y | > Bele 
7 k! 
for any 0 > 0. Thus, a comparison of coefficients in front of 6* leads to 


t! na; 
De a, gs 


j,kij+kst 


ie., h(T) is the UMVUE of ¥. In particular, if 0 = 6” for some fixed integer 
r > 1, then a, =1 anda, =O0ifk #r and 


0 t<r 
h(t) = t! poy 
nr (t—r)! ee 


The second method of deriving a UMVUE when there is a sufficient and 
complete statistic T(X) is conditioning on T, i.e., if U(X) is any unbiased 
estimator of J, then E[U(X)|T] is the UMVUE of ¥. To apply this method, 
we do not need the distribution of T, but need to work out the conditional 
expectation E[U(X)|T]. From the uniqueness of the UMVUE, it does not 
matter which U(X) is used and, thus, we should choose U(X) so as to make 
the calculation of E[U(X)|T] as easy as possible. 


Example 3.3. Consider the estimation problem in Example 2.26, where 
0 =1-— F(t) and Fo(x) = (1 — e~*/*)I(o,0) (a). Since X is sufficient and 
complete for 6 > 0 and I(z,.,)(X1) is unbiased for ¥, 


T(X) = ElI(t,00)(X1)|X] = P(X > t|X) 


is the UMVUE of ¥. If the conditional distribution of X; given X is avail- 
able, then we can calculate P(X, > t|X) directly. But the following tech- 
nique can be applied to avoid the derivation of conditional distributions. 
By Basu’s theorem (Theorem 2.4), X,/X and X are independent. By 
Proposition 1.10(wvii), 


P(X, >t|X = &) = P(X1/X >t/X|X =z) = P(X1/X > t/2). 


164 3. Unbiased Estimation 


To compute this unconditional probability, we need the distribution of 


x, / Ox =x, / (+) 
i=l 1=2 


Using the transformation technique discussed in §1.3.1 and the fact that 
Soy» X; is independent of X; and has a gamma distribution, we obtain 
that X1/ >>7_, X; has the Lebesgue p.d.f. (n—1)(1—)"~*J(o,1) (a). Hence 


Pon >=) =-1) f (2) de = Com 


t/(nz) NX 


and the UMVUE of @ is 


T(X) = (1 = 3 , 


We now show more examples of applying these two methods to find 
UMVUE’s. 


Example 3.4. Let X1,..., Xp be iid. from N(y, 07) with unknown p € R 
and og? > 0. From Example 2.18, T = (X,S7) is sufficient and com- 
plete for 0 = (1,07) and X and (n — 1)$?/o? are independent and have 
the N(u,0?/n) and chi-square distribution y?_,, respectively. Using the 
method of solving for h directly, we find that the UMVUE for p is X; the 
UMVUE of p? is X?— $2/n; the UMVUE for o” with r > 1—nis ky_1,S", 
where 
n"/?T(n/2) 

m= SPD (EE) 
(exercise); and the UMVUE of y/o is kn_1,-1X/S, if n > 2. 

Suppose that @ satisfies P(X, < J) = p with a fixed p € (0,1). Let ® 
be the c.d.f. of the standard normal distribution. Then J = + 0®~*(p) 
and its UMVUE is X + kn_11S5®71(p). 

Let ¢ be a fixed constant and 9 = P(X, < c) = @(—#). We can 
find the UMVUE of ¥ using the method of conditioning and the technique 
used in Example 3.3. Since [(_.0,¢)(X1) is an unbiased estimator of J, the 
UMVUE of # is E[I(—.0,c)(X1)|T] = P(X1 < ¢lT). By Basu’s theorem, 
the ancillary statistic Z(X) = (X1 — X)/S is independent of T = (X, $7). 
Then, by Proposition 1.10(vii), 


P(X, <elT = (, s)) =P(z< 


3.1. The UMVUE 165 


It can be shown that Z has the Lebesgue p.d_f. 

vat (25+) f nz? |": 
Vn(n — 10 (357) (n— 1)? 
(exercise). Hence the UMVUE of ¥ is 


f= Lon—-v/va(lzl) (3-1) 


(c—X)/S 
P(x, <r) = | f(2)dz (3.2) 
—(n-1)//n 
with f given by (3.1). 

Suppose that we would like to estimate J) = ig’ (=), the Lebesgue 
p.d.f. of X, evaluated at a fixed c, where ©®’ is the first-order derivative 
of ®. By (3.2), the conditional p.d.f. of X1 given X = % and S? = s? is 
s-1f (=). Let fr be the joint p.d.f. of T = (X,S?). Then 


+= [fo 2) nome (A)). 


Hence the UMVUE of @ is 


Example 3.5. Let X1,..., X» bei.i.d. from a power series distribution (see 
Exercise 13 in §2.6), ie., 


P(X; = x) = 7(2)0" /c(0), xz =0,1,2,..., 


with a known function y(x) > 0 and an unknown parameter 6 > 0. It turns 
out that the joint distribution of X = (X1,...,X,) is in an exponential fam- 
ily with a sufficient and complete statistic T(X) = )7/_, X;. Furthermore, 
the distribution of T is also in a power series family, i.e., 


P(T =t) = yn(t)0" /[e(0)]”, t= OT 232s) 


where ¥,,(t) is the coefficient of 6¢ in the power series expansion of [c(@)]” 


(Exercise 13 in §2.6). This result can help us to find the UMVUE of 0 = 
g(@). For example, by comparing both sides of 


do R(t)in 1)" = [e(8)"-P6", 


we conclude that the UMVUE of 6" /[c(@)]? is 


aC) 0 T<r 
= In=p(T-r) 
Yn(T) L 2 % 


166 3. Unbiased Estimation 


where r and p are nonnegative integers. In particular, the case of p = 1 
produces the UMVUE 4(r)h(T) of the probability P(X, = r) = y(r)0" /c(A) 
for any nonnegative integer r. I 


Example 3.6. Let X,..., X» be i.i.d. from an unknown population P in a 
nonparametric family P. We have discussed in $2.2 that in many cases the 
vector of order statistics, T = (X(1), ..., X(n)), is sufficient and complete for 
PeéP. Note that an estimator y(X1,..., X») is a function of T if and only if 
the function y is symmetric in its n arguments. Hence, if T is sufficient and 
complete, then a symmetric unbiased estimator of any estimable ¥ is the 
UMVUE. For example, X is the UMVUE of 0 = E.X,; S? is the UMVUE 
of Var(X1); n~! Soy, X? — S? is the UMVUE of (£.X1)?; and F;,,(t) is the 
UMVUE of P(X, < t) for any fixed ¢. 

Note that these conclusions are not true if T is not sufficient and com- 
plete for P € P. For example, if P contains all symmetric distributions 
having Lebesgue p.d.f.’s and finite means, then there is no UMVUE for 
0 = EX; (exercise). I 


More discussions of UMVUE?’s in nonparametric problems are provided 
in §3.2. 


3.1.2 A necessary and sufficient condition 


When a complete and sufficient statistic is not available, it is usually very 
difficult to derive a UMVUE. In some cases, the following result can be 
applied, if we have enough knowledge about unbiased estimators of 0. 


Theorem 3.2. Let U be the set of all unbiased estimators of 0 with finite 
variances and T be an unbiased estimator of 0 with E(T?) < oo. 

(i) A necessary and sufficient condition for T(X) to be a UMVUE of ¥ is 
that E[T(X)U(X)] =0 for any U €U and any P EP. 

(ii) Suppose that T = A(T), where T is a sufficient statistic for P € P and h 
is a Borel function. Let Uj; be the subset of U consisting of Borel functions 
of T. Then a necessary and sufficient condition for T to be a UMVUE of ¥ 
is that E[T'(X)U(X)] = 0 for any U € Uz and any P € P. 

Proof. (i) Suppose that T is a UMVUE of J. Then T, = T + cU, where 
U €U and cis a fixed constant, is also unbiased for 7 and, thus, 


Var(T.) > Var(T), cER, PEP, 
which is the same as 
e’Var(U) + 2ceCov(T,U) > 0, cER, PeP. 
This is impossible unless Cov(T, U) = E(TU) = 0 for any P € P. 


3.1. The UMVUE 167 


Suppose now £(TU) = 0 for any U €U and P € P. Let To be another 
unbiased estimator of 0 with Var(To) < oo. Then T — Tp € UY and, hence, 


E|T(T—T)|=0 PeP, 
which with the fact that ET = ET implies that 
Var(T) = Cov(T, To) PeP. 
By inequality (1.37), [Cov(T,Tp)]? < Var(T)Var(To). Hence Var(T) < 
Var(To) for any P € P. 
(ii) It suffices to show that E(TU) = 0 for any U € Up and P € P implies 


that E(TU) = 0 for any U €U and Pe P. Let U EU. Then E(U|T) € Uz 
and the result follows from the fact that T = h(T) and 


E(TU) = E[E(TU|T)| = E[E(h(T)U|T)] = E[h(T)E(U|T)]. 0 


Theorem 3.2 can be used to find a UMVUE, to check whether a partic- 
ular estimator is a UMVUE, and to show the nonexistence of any UMVUE. 
If there is a sufficient statistic, then by Rao-Blackwell’s theorem, we only 
need to focus on functions of the sufficient statistic and, hence, Theorem 
3.2(ii) is more convenient to use. 


Example 3.7. Let Xj,...,X, be i.i.d. from the uniform distribution on 
the interval (0,0). In Example 3.1, (1 + n~')X,() is shown to be the 
UMVUE for 6 when the parameter space is © = (0,00). Suppose now that 
© = [1,0o). Then X(p) is not complete, although it is still sufficient for 0. 
Thus, Theorem 3.1 does not apply. We now illustrate how to use Theorem 
3.2(ii) to find a UMVUE of 6. Let U(X(,)) be an unbiased estimator of 0. 
Since X(,) has the Lebesgue p.d.f. nO~"x"~1I(o.9)(x), 


1 6 
0= : U(x)x"—' da + / U(a)x"— dx 
0 1 


for all @ > 1. This implies that U(x) = 0 a.e. Lebesgue measure on [1, 00) 
and 


1 
| U(x)" dx = 0. 
0 
Consider T = h(X(n)). To have E(TU) = 0, we must have 


1 
i h(x)U(x)x”"'da = 0. 
0 
Thus, we may consider the following function: 


Cc O0<a<l 
h(x) = a 
(x) re z>1, 


168 3. Unbiased Estimation 


where c and 6 are some constants. From the previous discussion, 
E{h(X(n) U(X (ny) = 0, 60> 1. 

Since E[h(X(n))] = 9, we obtain that 

= cP(X(n) <1) + BEX (ny L(1,00)(X(n))] 

= cO~" + [bn/(n + 1)](0-0-”). 

Thus, c= 1 and b= (n+ 1)/n. The UMVUE of @ is then 


T= { 1 0< Xm) <1 
(L+n")X(n) X(n) > 1. 


DS 


This estimator is better than (1 + n~')X,,), which is the UMVUE when 
© = (0,00) and does not make use of the information about@>1. I 


Example 3.8. Let X be a sample (of size 1) from the uniform distribution 
U(@— 5,0+4), 0 € R. We now apply Theorem 3.2 to show that there 
is no UMVUE of ¥ = g(@) for any nonconstant function g. Note that an 
unbiased estimator U(X) of 0 must satisfy 


0+4 
| : U(x)dx = 0 for all OER. 
9-5 

Differentiating both sizes of the previous equation and applying the result 
of differentiation of an integral lead to U(a) = U(x +1) a.e. m, where m is 
the Lebesgue measure on R. If T is a UMVUE of g(6), then T(X)U(X) is 
unbiased for 0 and, hence, T(”)U(«) = T(a+1)U(a+1) a.e. m, where U(X) 
is any unbiased estimator of 0. Since this is true for all U, T(x) = T(a +1) 
a.e. m. Since T is unbiased for g(0), 


0+4 
g(0) = { T(x)dx for all € R. 
6 


1 


2 


Differentiating both sizes of the previous equation and applying the result 
of differentiation of an integral, we obtain that 


(0) =T(04+1)-T(6-1)=0 acm. 1 


As a consequence of Theorem 3.2, we have the following useful result. 


Corollary 3.1. (i) Let Tj be a UMVUE of ¥;, j = 1,...,k, where k is a 
fixed positive integer. Then es cjT; is a UMVUE of 0 = ye cjV; for 
any constants ¢€1, ..., Cr. 

(ii) Let T; and Tj be two UMVUE’s of 8. Then T; = T> as. P for any 
Pep. t 


3.1. The UMVUE 169 


3.1.3 Information inequality 


Suppose that we have a lower bound for the variances of all unbiased esti- 
mators of ? and that there is an unbiased estimator T of 0 whose variance 
is always the same as the lower bound. Then T is a UMVUE of ¥. Al- 
though this is not an effective way to find UMVUE’s (compared with the 
methods introduced in §3.1.1 and §3.1.2), it provides a way of assessing 
the performance of UMVUE?’s. The following result provides such a lower 
bound in some cases. 


Theorem 3.3 (Cramér-Rao lower bound). Let X = (Xj,..., Xn) be a sam- 
ple from P € P = {Py : 6 € O}, where @ is an open set in R*. Suppose 
that T(X) is an estimator with E[T(X)] = g(@) being a differentiable func- 
tion of 0; Py has a p.d.f. fg w.r.t. a measure v for all 6 € ©; and fg is 
differentiable as a function of 6 and satisfies 


xf (2) fo(a yav = fie a) fol adv, O0€8, (3.3) 


for h(a) = 1 and h(a) = T(x). Then 
Var(T(X)) = [9)]" UO) S90), (3.4) 
where ; F 2 
10) = B{ Fog folX) |Z toe (x) } (35) 


is assumed to be positive definite for any 0 € O. 
Proof. We prove the univariate case (k = 1) only. The proof for the 
multivariate case (k > 1) is left to the reader. When k = 1, (3.4) reduces 


to ‘ 
[9'()] 
mE 
E [Slog fo(X)] 
From inequality (1.37), we only need to show that 


6|5 log fo(X i) = Var (5 log fol) ) 


Var(T(X)) > (3.6) 


and ds 
g'(0) = Cov (TX), Sloe fol) 
These two results are consequences of condition (3.3). Il 


The k x k matrix I(@) in (3.5) is called the Fisher information matriz. 
The greater I(0) is, the easier it is to distinguish #0 from neighboring values 


170 3. Unbiased Estimation 


and, therefore, the more accurately 6 can be estimated. In fact, if the 
equality in (3.6) holds for an unbiased estimator T(X) of g(@) (which is 
then a UMVUE), then the greater [(0) is, the smaller Var(T(X)) is. Thus, 
I(0) is a measure of the information that X contains about the unknown 
6. The inequalities in (3.4) and (3.6) are called information inequalities. 


The following result is helpful in finding the Fisher information matrix. 


Proposition 3.1. (i) Let X and Y be independent with the Fisher informa- 
tion matrices Ix (@) and Iy (6), respectively. Then, the Fisher information 
about @ contained in (X,Y) is Ix (0) + Iy(@). In particular, if X1,...,Xn 
are ii.d. and J,(0) is the Fisher information about @ contained in a single 
X;, then the Fisher information about @ contained in Xj,..., Xp, is nl (@). 
(ii) Suppose that X has the p.d.f. fg that is twice differentiable in 6 and 
that (3.3) holds with h(x) = 1 and fg replaced by Of9/00. Then 


1(0)=-B ar ip fo(X)| (3.7) 


Proof. Result (i) follows from the independence of X and Y and the 
definition of the Fisher information. Result (ii) follows from the equality 


a soar fo(X) A a 
ay 08 fal X) = BE — Fos fal) | Fp low (| | 


The following example provides a formula for the Fisher information 
matrix for many parametric families with a two-dimensional parameter 0. 


Example 3.9. Let Xj,...,X» be iid. with the Lebesgue p.d.f. if (=), 
where f(x) > 0 and f’(x) exists for alla € R, w € R, anda > 0 (a 
location-scale family). Let 0 = (u,o). Then, the Fisher information about 


6 contained in Xj, ..., Xp is (exercise) 


Lf! (@)? fee f! (@)+ F(x) 
: es come J fe) 4 
10) = it 
f' (ya f' (w)+f(x)] [wf’(@)+f (xP 
i Fx) de [=a 


Note that [(@) depends on the particular parameterization. If @ = ~(n) 
and w is differentiable, then the Fisher information that X contains about 
7 is 

a a 7 

Zwv(nro(n) [Ev] 
However, it is easy to see that the Cramér-Rao lower bound in (3.4) or (3.6) 
is not affected by any one-to-one reparameterization. 


3.1. The UMVUE 171 


If we use inequality (3.4) or (3.6) to find a UMVUE T(X), then we 
obtain a formula for Var(T'(X)) at the same time. On the other hand, the 
Cramér-Rao lower bound in (3.4) or (3.6) is typically not sharp. Under 
some regularity conditions, the Cramér-Rao lower bound is attained if and 
only if fg is in an exponential family; see Propositions 3.2 and 3.3 and 
the discussion in Lehmann (1983, p. 123). Some improved information 
inequalities are available (see, e.g., Lehmann (1983, Sections 2.6 and 2.7)). 


Proposition 3.2. Suppose that the distribution of X is from an expo- 
nential family {fo : 0 € O}, ie., the p.d-f. of X w.r.t. a o-finite measure 
is 


fo(x) = exp{[n()]” T(x) — €(9) fe(x) (3.8) 


(see §2.1.3), where © is an open subset of R*. 

(i) The regularity condition (3.3) is satisfied for any h with E|h(X)| < co 
and (3.7) holds. 

(ii) If L() is the Fisher information matrix for the natural parameter 7, 
then the variance-covariance matrix Var(T) = I(7). 

(iii) If 7(@) is the Fisher information matrix for the parameter J = E[T(X)], 
then Var(T) = [7()]-?. 

Proof. (i) This is a direct consequence of Theorem 2.1. 

(ii) From (2.6), the p.d.f. under the natural parameter 77 is 


n(x) = exp {nT (x) — C(n)} ela). 


From Theorem 2.1 and result (1.54) in §1.3.3, E[T(X)] = 2¢(n). The 
result follows from 


¢ 
In) = 27(9) (32) = Src) [pFec(n)] 


By Theorem 2.1, result (1.54), and the result in (ii), 52-—¢(n) = Var(T) = 
I(n). Hence 


(9) = (LL) La) = La)" = [Var(T)-". 


A direct consequence of Proposition 3.2(ii) is that the variance of any 
linear function of T in (3.8) attains the Cramér-Rao lower bound. The 
following result gives a necessary condition for Var(U(X)) of an estimator 
U(X) to attain the Cramér-Rao lower bound. 


172 3. Unbiased Estimation 


Proposition 3.3. Assume that the conditions in Theorem 3.3 hold with 
T(X) replaced by U(X) and that 0 C R. 
(i) If Var(U(X)) attains the Cramér-Rao lower bound in (3.6), then 


a(8)(U(X) — 9(8)] = 9! Oar log folX) as. Pr 


for some function a(@), 0 € O. 
(ii) Let fg and T be given by (3.8). If Var(U(X)) attains the Cramér-Rao 
lower bound, then U(X) is a linear function of T(X) as. P9,9€O. I 


Example 3.10. Let X1,..., Xn be iid. from the N(, 07) distribution with 
an unknown pp € R and a known o”. Let f,, be the joint distribution of 
XSi Re) Then 


ip log fulX) = D(X — w)/0. 


Thus, J(u) = n/o?. It is obvious that Var(X) attains the Cramér-Rao lower 
bound in (3.6). Consider now the estimation of 0 = y?. Since EX? = 
uw? + 07/n, the UMVUE of 8 is h(X) = X? —o?/n. A straightforward 
calculation shows that 


Verne 


On the other hand, the Cramér-Rao lower bound in this case is 4:70? /n. 
Hence Var(h(X)) does not attain the Cramér-Rao lower bound. The dif- 
ference is 20+/n?. Il 


Condition (3.3) is a key regularity condition for the results in Theorem 
3.3 and Proposition 3.3. If fg is not in an exponential family, then (3.3) has 
to be checked. Typically, it does not hold if the set {x : f(a) > 0} depends 
on @ (Exercise 37). More discussions can be found in Pitman (1979). 


3.1.4 Asymptotic properties of UMVUE’s 


UMVUE’s are typically consistent (see Exercise 106 in 2 6). If there is 
an unbiased estimator of 3 whose mse is of the order a;?, where {an} is 
a sequence of positive numbers diverging to oo, then the UMVUE of # (if 
it exists) has an mse of order a,” and is a,-consistent. For instance, in 
Example 3.3, the mse of U(X) = 1— F,,(t) is Fo(t)[1 — Fo(t)]/n; hence the 
UMVUE T(X) is \/n-consistent and its mse is of the order n~+. 


UMVUE’s are exactly unbiased so that there is no need to discuss their 
asymptotic biases. Their variances (or mse’s) are finite, but amse’s can be 


3.1. The UMVUE 173 


used to assess their performance if the exact forms of mse’s are difficult 
to obtain. In many cases, although the variance of a UMVUE T,, does 
not attain the Cramér-Rao lower bound, the limit of the ratio of the amse 
(or mse) of T,, over the Cramér-Rao lower bound (if it is not 0) is 1. For 
instance, in Example 3.10, 
v2 _ 2 2 
Var(X* — o*/n) ee, 
Qu?n 

if ~ # 0. In general, under the conditions in Theorem 3.3, if T,(X) is 
unbiased for g(@) and if, for any 0 € 0, 


Ta(X) — 9(9) = [90] UO)" Flog fo X) [1 + op(1)] as. Po, (3.9) 
then 


le” 1 
the Cramér-Rao lower bound 


Tr 


amser, (9) = the Cramér-Rao lower bound (3.10) 


whenever the Cramér-Rao lower bound is not 0. Note that the case of zero 
Cramér-Rao lower bound is not of interest since a zero lower bound does 
not provide any information on the performance of estimators. 


Consider the UMVUE T, = (1——&)"* of e~'/® in Example 3.3. 
Using the fact that 


we obtain that : 
Tn — et/X — Op (n*) : 
Using Taylor’s expansion, we obtain that 
e 1X _ eH? = of(8)(X — 8)[1 + op(1)], 
where g(@) = e~*/®. On the other hand, 
[1(0)|-? 3, log fo(X) = X 0. 


Hence (3.9) and (3.10) hold. Note that the exact variance of T,, is not 
easy to obtain. In this example, it can be shown that {n[T,, — g(0)]|?} is 
uniformly integrable and, therefore, 


lim nVar(T,) = lim n[amser, (8)] 


lim nfg'(6)]? (2) 


n—Cco 
t2e72t/0 
2 
It is shown in Chapter 4 that if (3.10) holds, then T,, is asymptotically 
optimal in some sense. Hence UMVUE?’s satisfying (3.9), which is often 
true, are asymptotically optimal, although they may be improved in terms 
of the exact mse’s. 


I 


174 3. Unbiased Estimation 


3.2 U-Statistics 


Let X1,..., Xp be i.i.d. from an unknown population P in a nonparametric 
family P. In Example 3.6 we argued that if the vector of order statistic is 
sufficient and complete for P € P, then a symmetric unbiased estimator 
of any estimable J is the UMVUE of J. In a large class of problems, 
parameters to be estimated are of the form 


0 = Elh(X%,...,Xm)] 


with a positive integer m and a Borel function h that is symmetric and 
satisfies E|h(X1,...,Xm)| < co for any P € P. It is easy to see that a 
symmetric unbiased estimator of 0 is 


Un = (0) SA Xa (3.11) 


where }>., denotes the summation over the (2) combinations of m distinct 
elements {71,...,¢m} from {1,..., n}. 


Definition 3.2. The statistic U,, in (3.11) is called a U-statistic with kernel 
h of orderm. 1 


3.2.1 Some examples 


The use of U-statistics is an effective way of obtaining unbiased estimators. 
In nonparametric problems, U-statistics are often UMVUE?’s, whereas in 
parametric problems, U-statistics can be used as initial estimators to derive 
more efficient estimators. 

If m = 1, U, in (3.11) is simply a type of sample mean. Examples 
include the empirical c.d.f. (2.28) evaluated at a particular t and the sample 
moments n~! )*"_, X for a positive integer k. We now consider some 
examples with m > 1. 

Consider the estimation of 0 = py”, where pp = EX, and m is a positive 
integer. Using h(a1,...,%m) = %1-++Xm, we obtain the following U-statistic 
unbiased for J = y™: 


Un = (") Dx ee (3.12) 


Consider next the estimation of ? = 0? = Var(X1). Since 


o” = [Var(X1) + Var(X2)]/2 = E[(X1 — X2)"/2I, 


3.2. U-Statistics 175 


we obtain the following U-statistic with kernel h(a, 22) = (x1 — x2)?/2: 


= 2 (Rea 1 . 2 v2) _ o2 
Le Bean) 2 2 nl De ayer 


1<i<j<n 


which is the sample variance in (2.2). 


In some cases, we would like to estimate J) = E|X, — X2|, a measure of 
concentration. Using kernel h(x, 22) = |%1 — x2|, we obtain the following 
U-statistic unbiased for 0) = E|X, — X2!: 


Ue be [Xi — X5I, 


n(n — 1) 1<i<j<n 


which is known as Gini’s mean difference. 
Let 0 = P(X1+X2 < 0). Using kernel h(a1, x2) = I(_20,9)(€1 + 22), we 
obtain the following U-statistic unbiased for v: 


Os i ah Ds Fes.) 2G Fy); 
1<i<j<n 
which is known as the one-sample Wilcoxon statistic. 


Let T, = Tn(X,..., Xn) be a given statistic and let r and d be two 
positive integers such that r +d =n. For any s = {i1,...,7-} C {1,...,n}, 
define 

Tega TX a KE) 


which is the statistic T,, computed after X;, i ¢ s, are deleted from the 
original sample. Let 


U;= Gy 3 = (ag hy, (3.13) 


Then U,, is a U-statistic with kernel 
hy(@1,..., 07) = 5[T, (a1, ..)2r) — Tr (21, eee 


Unlike the kernels in the previous examples, the kernel in this example 
depends on n. The order of the kernel, r, may also depend on n. The 
statistic U, in (3.13) is known as the delete-d jackknife variance estimator 
for T, (see, e.g., Shao and Tu (1995)), since it is often true that 


Elhy(X1,..., X-)] & Var(Tn). 


It can be shown that if JT, = X, then nU,, in (3.13) is exactly the same as 
the sample variance S? (exercise). 


176 3. Unbiased Estimation 


3.2.2 Variances of U-statistics 


If E[h(Xi,...,Xm)]? < oo, then the variance of U, in (3.11) with kernel 
h has an explicit form. To derive Var(U;,), we need some notation. For 
k =1,...,m, let 
he(a1, wt Di) — E{h(X,, very Xm) |X =71, ws Xk = LE] 
= E{h(z, wy Uk, Xk41; so Mere) | 


Note that hm = h. It can be shown that 


hgl@iyiey fe) = Elle i, 0h Bey Keg): (3.14) 
Define 7 
Rp = hy — E[A(X1,..., Xm)], (3.15) 
k =1,...,m, and h = hm. Then, for any U;, defined by (3.11), 
= 
nr ~ 
nL n) = h(X; ere, ©; . 1 
Un— Bn) = (") SRK Xin) (3.16) 


Theorem 3.4 (Hoeffding’s theorem). For a U-statistic Un given by (3.11) 
with E[h(X1,...,Xm)]? < 00, 


ven () Ga) 


where 
Gk = Var(he(X1,..., Xx)). 


Proof. Consider two sets {71,...,¢m} and {j1,..., jm} of m distinct integers 
from {1,...,n} with exactly k integers in common. The number of distinct 
choices of two such sets is (”)(7)("—"'). By the symmetry of hm and 
independence of X1,...,Xn, 

Bh Kay ip Kg DB Kis os XG WS Ge (3.17) 


for k = 1,...,m (exercise). Then, by (3.16), 


Var(U;,) = oa SS Eases hi Oe i) 


0) S(V(a) (tome 


This proves the result. 


3.2. U-Statistics 177 


Corollary 3.2. Under the condition of Theorem 3.4, 

(i) G1 < Var(Un) < 2Gn3 

(ii) (2 + 1)Var(Un41) < nVar(U,,) for any n > m; 

(iii) For any fixed m and k =1,...,m, if ¢; =0 for 7 < k and ¢, > 0, then 


k! m aC 1 
Var(Un) = oS +0 (<=) 4 


It follows from Corollary 3.2 that a U-statistic U, as an estimator of its 
mean is consistent in mse (under the finite second moment assumption on 
h). In fact, for any fixed m, if ¢; =0 for 7 < k and ¢, > 0, then the mse of 
U,, is of the order n~* and, therefore, U;, is n*/?-consistent. 


Example 3.11. Consider first h(21,22) = 21%2, which leads to a U- 
statistic unbiased for w?, pp = EX. Note that hi(ai) = par, (1) = 
w(ai — pw), G = Elta (Xi)? = p?Var(X1) = p?o?, h(ai, x2) = x14 — p?, 
and (2 = Var(X1X2) = E(X1X2)? — w* = (u? +o”)? — p*. By Theorem 


3.4, for Un = (")* Si ee 


(0) (OC7)+OCr7)¢ 


I 


Var(U,,) 


2 
= ) =9 22 2 2\2_ 4 
Cm a )e oe (uP Fat Pye] 
Au2o2 24 
_4y2o? | ot 
n n(n — 1) 


Comparing U, with X? — o?/n in Example 3.10, which is the UMVUE 
under the normality and known o? assumption, we find that 


nn an ce ee 
ss n?2(n—1) 

Next, consider h(%1,2%2) = I(—oo,0](%1 + @2), which leads to the one- 
sample Wilcoxon statistic. Note that hi(a1) = P(a1 + X2q < 0) = F(—21), 
where F is the c.d.f. of P. Then ¢; = Var(F'(—X1)). Let 0 = E[h(X1, X2)]. 
Then ¢2 = Var(h(X1, X2)) = V(1—¥V). Hence, for U,, being the one-sample 
Wilcoxon statistic, 


2 


Var(U,,) = n(n—1) 


[2(n — 2)¢, + V(1— ¥)]. 
If F is continuous and symmetric about 0, then ¢; can be simplified as 


¢ = Var(F(—X1)) = Var(1 — F(X1)) = Var(F(X)) = 4 


12? 


178 3. Unbiased Estimation 


since F(X 1) has the uniform distribution on [0, 1]. 


Finally, consider h(21,%2) = |x, — x2|, which leads to Gini’s mean dif- 
ference. Note that 


ha(v1) = Bles — Xal = f |x — ylaPly) 
and ‘ 
c= Var(tn(3)) = f | f je alae] aP(a) - v2 
where 0 = E|X,—Xo|. 0 


3.2.3 The projection method 


Since P is nonparametric, the exact distribution of any U-statistic is hard 
to derive. In this section, we study asymptotic distributions of U-statistics 
by using the method of projection. 


Definition 3.3. Let T,, be a given statistic based on Xj,...,Xn. The 
projection of T,, on ky, random elements Y}j,..., Yx,, is defined to be 
kn 


i=1 


Let Wn (Xi) = E(T,|Xi). If Tp is symmetric (as a function of Xj, ..., Xn), 
then n(X1),...,Yn(Xn) are iid. with mean E[w,(X;)] = E[E(Tn|X;)] = 
E(T,). If E(T2) < co and Var(w»,(X;)) > 0, then 


1 nm 
SD ll a) — BU) Pa NO) (3.18) 
\/nVar(vn(X71)) » 
by the CLT. Let T,, be the projection of JT, on Xj,..., Xn. Then 
Tn — Tn = Tn — E(In) — S“[n(Xi) — E(Dn)]. (3.19) 
i=l 


If we can show that 7, — 7; has a negligible order of magnitude, then 
we can derive the asymptotic distribution of T,, by using (3.18)-(3.19) and 
Slutsky’s theorem. The order of magnitude of T,, —T,, can be obtained with 
the help of the following lemma. 


Lemma 3.1. Let T;, be a symmetric statistic with Var(T;,) < co for every 
n and T,, be the projection of T, on X1,...,X,. Then E(T,,) = E(T;,) and 


E(T) — Ty)? = Var(Tn) — Var(Tr): 


3.2. U-Statistics 179 


Proof. Since E(T;,) = E(Tn), 


E(Tn — Tn)? = Var(Tn) + Var(Tn) — 2Cov(Tr, Tn). 
From Definition 3.3 with Y¥; = X; and k, =n, 
Var(Tn) = nVar(E(Tn|X;)). 
The result follows from 


Cov(Tn; Tn) = En Tx) — (EG)? 
= nE(T, E(Tn|Xi)] — n[E(Tn) I? 
= nE{E[TE(Ta|Xi)|Xi]} — n{E(Tn)? 
= nE{[E(TnlXi)]"} — n[E(Tn)I? 
= nVar(E(Tn|X:)) 
= Var(T,). I 


This method of deriving the asymptotic distribution of T;,, is known as 
the method of projection and is particularly effective for U-statistics. For 
a U-statistic U, given by (3.11), one can show (exercise) that 


Un = E(Un) + = > hn (X2), (3.20) 


where U,, is the projection of U, on Xj,..., Xp» and hy is defined by (3.15). 
Hence 
Var(Un) = m?G,/n 


and, by Corollary 3.2 and Lemma 3.1, 
E(U, — Un)? = O(n-*). 


If ¢, > 0, then (3.18) holds with ~,(X;) = mh1(X;), which leads to the 
result in Theorem 3.5(i) stated later. 


If ¢, = 0, then h, = 0 and we have to use another projection of Up. 
Suppose that C1 ee Ch-1 0 and ¢, > 0 for an integer k > 1. 
Consider the projection Un of Uy on (7) random vectors {X;,,...,Xi, }, 
1<iy <-++ <i, <n. We can establish a result similar to that in Lemma 
3.1 (exercise) and show that 


E(Un — Un)? = O(n~"*9)., 


Also, see Serfling (1980, §5.3.4). 
With these results, we obtain the following theorem. 


180 3. Unbiased Estimation 


Theorem 3.5. Let U;, be given by (3.11) with E[h(Xq,..., Xm)]? < oo. 
(i) If ¢, > 0, then 


/n[U, — E(U,)| 4a N(0,m?1). 
(ii) If G = 0 but ¢ > 0, then 


n[Un — E(Un)| > mt dj (x4; — 1), (3.21) 


where xi; ’s are i.i.d. random variables having the chi-square distribution x7 


and \,’s are some constants (which may depend on P) satisfying }>~ ee = = 
Go. ft 


We have actually proved Theorem 3.5(i). A proof for Theorem 3.5(ii) is 
given in Serfling (1980, §5.5.2). One may derive results for the cases where 
¢g = 0, but the case of either ¢; > 0 or C2 > 0 is the most interesting case 
in applications. 

If ¢; > 0, it follows from Theorem 3.5(i) and Corollary 3.2(iii) that 
amsey,(P) = m?G/n = Var(U;,) + O(n~?). By Proposition 2.4(ii), 
{n[U, — E(Un)|?} is uniformly integrable. 

If ¢, =0 but ¢2 > 0, it follows from Theorem 3.5(ii) that amsey, (P) = 
EY?/n?, where Y denotes the random variable on the right-hand side of 
(3.21). The following result provides the value of EY?. 


Lemma 3.2. Let Y be the random variable on the right-hand side of 
2 2 

(3.21). Then EY? = mimo. 

Proof. Define 


k 
Y, = ENS 5 OgFa Din RAE es 
j=l 


It can be shown (exercise) that {Y,7} is uniformly integrable. Since Y;, >a Y 
as k — oo, limp. EY? = EY* (Theorem 1.8(viii)). Since xj,’s are 
independent chi-square random variables with Ej; = 1 and Var(xj;) = 2, 
EY; = 0 for any k and 


eye = MOY paver’, 
j=l 
=e) 2s ? 
4 A q 
= a i) Ce: 


3.2. U-Statistics 181 


It follows from Corollary 3.2(iii) and Lemma 3.2 that amsey,(P) = 
2 2 
mim" 65 /n? = Var(U,) + O(n~3) if G = 0. Again, by Proposition 
2.4(ii), the sequence {n?[U,, — E(U;,)]?} is uniformly integrable. 

We now apply ese 3.5 to the U-statistics in Example 3.11. For 
Uy = WRT Dixicj<n Xi 3, G. = po”. Thus, if w 4 0, the result in 
Theorem 3.5(i) holds with oe = po. If w= 0, then G =0, G@ = ot > 0, 
and Theorem 3.5(ii) applies. However, it is not convenient to use Theorem 
3.5(ii) to find the limiting distribution of U,,. We may derive this limiting 
distribution using the following technique, which is further discussed in 
§3.5. By the CLT and Theorem 1.10, 


nX?/o* >a x} 


when ps = 0, where y7 is a random variable having the chi-square distribu- 
tion y7. Note that 


Ame 
a amt , 


By the SLLN, = yyy, X? as. 1. An application of Slutsky’s theorem 
leads to 
nU,/o? ax —1. 

Since = 0, this implies that the right-hand side of (3.21) is 0?(yj — 1), 
ie., Ay =o? and A; = 0 when j > 1. 

For the one-sample Wilcoxon statistic, ¢, = Var(F'(—X1)) > 0 unless 
F is degenerate. Similarly, for Gini’s mean difference, ¢; > 0 unless F is 
degenerate. Hence Theorem 3.5(i) applies to these two cases. 


Theorem 3.5 does not apply to U,, defined by (3.13) if r, the order of 
the kernel, depends on n and diverges to co as n — oo. We consider the 


simple case where 
1 n 
= — > (%) + Rn (3.22) 
i=1 


for some R,, satisfying E(R?2) = o(n~'). Note that (3.22) is satisfied for 
Tn being a U-statistic (exercise). Assume that r/d is bounded. Let $2 = 
(nm — 1)~* SO (Xi) — nV YX) )?. Then 


nUn, = 5%, + op(1) (3:23) 


(exercise). Under (3.22), if 0 < E[w(X;))? < co, then amser,(P) = 
Elw(X;)|?/n. Hence, the jackknife estimator U,, in (3.13) provides a con- 
sistent estimator of amser, (P), i-e., U;,/amser, (P) —» 1. 


182 3. Unbiased Estimation 


3.3 The LSE in Linear Models 


One of the most useful statistical models for non-i.i.d. data in applications 
is the general linear model 


X; = BL; + Ei, i= Ls seeg Thy (3.24) 


where X; is the ith observation and is often called the ith response; ( 
is a p-vector of unknown parameters, p < n; Z; is the ith value of a p- 
vector of explanatory variables (or covariates); and ¢1,...,€) are random 
errors. Our data in this case are (X1,Z1),...,(Xn,Zn) (€;’s are not ob- 
served). Throughout this book Z;’s are considered to be nonrandom or 
given values of a random p-vector, in which case our analysis is conditioned 
on Z4,...,Zn. Each €; can be viewed as a random measurement error in 
measuring the unknown mean of X; when the covariate vector is equal to 
Z;. The main parameter of interest is 3. More specific examples of model 
(3.24) are provided in this section. Other examples and examples of data 
from model (3.24) can be found in many standard books for linear models, 
for example, Draper and Smith (1981) and Searle (1971). 


3.3.1 The LSE and estimability 


Let X = (X,..., Xn), © = (€1,-.-,En), and Z be the n x p matrix whose ith 
row is the vector Z;, i = 1,...,n. Then, a matrix form of model (3.24) is 


X=ZB+e. (3.25) 
Definition 3.4. Suppose that the range of 3 in model (3.25) is B C R?. 
A least squares estimator (LSE) of (@ is defined to be any @ € B such that 


|X — ZB)? = min |x — Zo)? (3.26) 


For any | € R?, I7B is called an LSE of I7G. It 


Throughout this book, we consider B = R? unless otherwise stated. 
Differentiating || X — Zb||? w.r-t. b, we obtain that any solution of 


Z’Zb=Z'X (3.27) 


is an LSE of @. If the rank of the matrix Z is p, in which case (77 Z)~+ 
exists and Z is said to be of full rank, then there is a unique LSE, which is 


Ba=(AZ) FX. (3.28) 


3.3. The LSE in Linear Models 183 


If Z is not of full rank, then there are infinitely many LSE’s of 3. It can 
be shown (exercise) that any LSE of ( is of the form 
B=(Z°Z)-Z'X, (3.29) 
where (Z7 Z)~ is called a generalized inverse of ZZ and satisfies 
LOGE OE oL ST. 


Generalized inverse matrices are not unique unless Z is of full rank, in which 
case (Z7Z)~ = (Z7Z)~! and (3.29) reduces to (3.28). 

To study properties of LSE’s of 3, we need some assumptions on the 
distribution of X. Since Z;’s are nonrandom, assumptions on the distribu- 
tion of X can be expressed in terms of assumptions on the distribution of 
€. Several commonly adopted assumptions are stated as follows. 


Assumption A1: « is distributed as N,,(0,07J,) with an unknown o? > 0. 
Assumption A2: E(e) = 0 and Var(e) = o7J, with an unknown o? > 0. 
Assumption A3: E(e) = 0 and Var(e) is an unknown matrix. 


Assumption A1 is the strongest and implies a parametric model. We 
may assume a slightly more general assumption that ¢ has the N,,(0,07D) 
distribution with unknown o? but a known positive definite matrix D. Let 
D~‘/? be the inverse of the square root matrix of D. Then model (3.25) 
with assumption Al holds if we replace X, Z, and € by the transformed 
variables X = D~'/2X, Z = D~/2Z, and € = D~1/2¢, respectively. A 
similar conclusion can be made for assumption A2. 

Under assumption Al, the distribution of X is N,(ZG,o7I,), which 
is in an exponential family P with parameter 0 = (3,07) € RP x (0,00). 
However, if the matrix Z is not of full rank, then P is not identifiable (see 
§2.1.2), since 73, = Zo does not imply 61 = Bo. 

Suppose that the rank of Z is r < p. Then there is an n x r submatrix 
Z, of Z such that 


Z=Z,Q (3.30) 
and Z, is of rank r, where Q is a fixed r x p matrix. Then 
ZB = Z,.Q6 


and P is identifiable if we consider the reparameterization B = QG. Note 
that the new parameter / is in a subspace of R? with dimension r. 


In many applications, we are interested in estimating some linear func- 
tions of 2, i.e., 0 = 17G for some 1 € R”. From the previous discussion, 
however, estimation of 172 is meaningless unless | = Q7c for some c € R” 
so that 


IB =c'QB=c'B. 


184 3. Unbiased Estimation 


The following result shows that [7 is estimable if | = Q%c, which is also 
necessary for 17G to be estimable under assumption Al. 


Theorem 3.6. Assume model (3.25) with assumption A3. 

(i) A necessary and sufficient condition for 1 € R? being Q7c for some 
cER' islER(Z) = R(ZZ), where Q is given by (3.30) and (A) is the 
smallest linear subspace containing all rows of A. 

(ii) If 1 € R(Z), then the LSE 17G is unique and unbiased for 176. 

(iii) If 1 € R(Z) and assumption Al holds, then 17 is not estimable. 
Proof. (i) Note that a € R(A) if and only if a = A7b for some vector b. If 
l= Q’c, then 


(SOG =0 2A eS Fe a: 
Hence 1 € R(Z). If l € R(Z), then | = Z7¢ for some ¢ and 
L= (Z.Q)"¢ =Q7c 


with c= Z7C. 
(ii) fle R(Z) = R(Z7Z), then 1 = Z7 Z¢ for some ¢ and by (3.29), 


E(B) = Ell" (ZZ) 27 X] 

= (°Z'Z(Z"Z)-Z° ZB 

= CL’ LB 

= 178. 
If G is any other LSE of 3, then, by (3.27), 

mB 0B =¢7(2"2)(B -B) = C2" X — 2°X) =0. 
(iii) Under assumption Al, if there is an estimator h(X,Z) unbiased for 
176, then 
"B= | h(x, Z)(2r)~"/?0-” exp {— she |x — Z||?} de. 
Rr 


Differentiating w.r.t. @ and applying Theorem 2.1 lead to 

= a | h(x, Z)(2n)-"/20-"-2 (2 — ZB) exp {—zbellx — ZAll?} de, 
which implies! E€ R(Z). 0 

Theorem 3.6 shows that LSE’s are unbiased for estimable parameters 


176. If Z is of full rank, then R(Z) = R? and, therefore, 173 is estimable 
for any 1 € RP. 


3.3. The LSE in Linear Models 185 


Example 3.12 (Simple linear regression). Let 3 = (60,1) € R? and 
Z, = (1,ti), t} € R, ti = 1,...,n. Then model (3.24) or (3.25) is called a 
simple linear regression model. It turns out that 


= . - . 
ist ti dist t 
This matrix is invertible if and only if some ¢;’s are different. Thus, if some 


t;’s are different, then the unique unbiased LSE of 17 for any | € R? is 
I7(Z7Z)-!Z* X, which has the normal distribution if assumption A1 holds. 


The result can be easily extended to the case of polynomial regression 
of order p in which 8 = (Go, 1, .--; Bp—1) and Z; = (1,t;,...,t?7"). 0 


Example 3.13 (One-way ANOVA). Suppose that n = peer nj with m 
positive integers nj, ...,%m and that 


X; = by + &j, tokyo +1, bys 7 Hawn, 


where ky = 0; kp = og tas SH 1 ny ond (ipa tin) =P Let: I be 
the m-vector of ones. Then the matrix Z in this case is a block diagonal 
matrix with J,,; as the jth diagonal column. Consequently, 2772 is an 
m xX m diagonal matrix whose jth diagonal element is n;. Thus, Z7Z is 
invertible and the unique LSE of (@ is the m-vector whose jth component 
is nj! Sane Mi FS green 

Sometimes it is more convenient to use the following notation: 


Xiz = Xki_atyj, Cig = Eki_atyy ge 1S La, 


and 
bi = + 04, i=1,...,m. 


Then our model becomes 
Xij = p+ ay t+ €4;, PH Lig nyt = lym, (3.31) 


which is called a one-way analysis of variance (ANOVA) model. Under 
model (3.31), 8 = (u,Q1,...,@m) € R™*. The matrix Z under model 
(3.31) is not of full rank (exercise). An LSE of 6 under model (3.31) is 


meds cee re cum a F 


where X is still the sample mean of X;j;’s and X;. is the sample mean of the 
ith group {Xi;,j =1,...,n:}. The problem of finding the form of 1 © R(Z) 
under model (3.31) is left as an exercise. I 


The notation used in model (3.31) allows us to generalize the one-way 
ANOVA model to any s-way ANOVA model with a positive integer s under 


186 3. Unbiased Estimation 


the so-called factorial experiments. The following example is for the two- 
way ANOVA model. 


Example 3.14 (Two-way balanced ANOVA). Suppose that 
Xijk = w+ a; + B; + Vij + Eijk, = 1, sony a,j = 1, oeey b, k= 1, wong C, (3.32) 
where a, b, and c are some positive integers. Model (3.32) is called a two- 
way balanced ANOVA model. If we view model (3.32) as a special case of 
model (3.25), then the parameter vector (3 is 
B _ (u, O11, +++) Aa, Ar, aes) Bo, VAL 5 ey Yds ee+9 Vals ees ab): (3.33) 


One can obtain the matrix Z and show that it is n x p, where n = abc and 
p=1+a+b+ab, and is of rank ab < p (exercise). It can also be shown 
(exercise) that an LSE of ( is given by the right-hand side of (3.33) with p, 


a;, 3;, and yj; replaced by ft, d;, 3;, and %4;, respectively, where fi = X..., 


on = Xj. = Dae B; = X.5. Pm Kies Vij = Xij. a Xj. aa X.5. + oe and a dot 
is used to denote averaging over the indicated subscript, e.g., 


" 1 a c 
Xj = =D Xige 
t=1 k=1 
with a fixed 7. I 


3.3.2 The UMVUE and BLUE 


We now study UMVUE?’s in model (3.25) with assumption Al. 


Theorem 3.7. Consider model (3.25) with assumption Al. 
(i) The LSE 17 is the UMVUE of 176 for any estimable 17. 
(ii) The UMVUE of o? is 6? = (n—1r)~1||X — Z||?, where r is the rank 
of Z. i 
Proof. (i) Let 6 be an LSE of 3. By (3.27), 
(X — ZB)" Z(B — B) = (X7Z— X7Z)(6— B) =0 
and, hence, 
|X — Z|? = |X — 26 + 26 — Zp||? 
= ||X — Z6||? + ||26 — Z|)? 
= ||_X — ZBl|? — 26727 X + ||ZAl° + ZAI’. 


Using this result and assumption A1, we obtain the following joint Lebesgue 
p.d.f. of X: 


ere Ay) 2 Ay 2 2 
(270?)-"/exp { 2 Zee = deste eZee 2 128) \. 


3.3. The LSE in Linear Models 187 


By Proposition 2.1 and the fact that 73 = Z(Z"Z)~ Z7X is a function of 
ZX, (Z7X,||X — Z|?) is complete and sufficient for 6 = (3,07). Note 
that ( is a function of 27 X and, hence, a function of the complete sufficient 
statistic. If I7G is estimable, then 17 is unbiased for 176 (Theorem 3.6) 
and, hence, 176 is the UMVUE of 17. ; ; 

(ii) From ||X — Z8|? = ||X - ZA|? +||Z8 — Zal? ana B(Z8) = 28 
(Theorem 3.6), 


E\|X — Z6||? = E(X — 28)" (X — ZB) — E(6 - 8)" Z7Z(6 — B) 
=tr (Var(X) = Var(Z8)) 
=o"[n—tr(Z(Z"Z) Z7Z(Z"Z) Z’)| 
= 07[n—tr((Z7Z)"Z7Z)]. 
Since each row of Z € R(Z), Z3 does not depend on the choice of (Z7 Z)~ in 


3 =(Z7Z)- ZX (Theorem 3.6). Hence, we can evaluate tr((Z7Z)~ ZZ) 
using a particular (Z7Z)~. From the theory of linear algebra, there exists 


ap X p matrix C' such that CC™ = I, and 


CT(Z72)C = ( ‘< : ) 


where A is an r x r diagonal matrix whose diagonal elements are positive. 
Then, a particular choice of (Z7Z)~ is 


(ZZ\-= c( i ; ) cr (3.34) 


and 


I, 0 
ZL) 2ZLZ=C a C7 
(22) ten 
whose trace is r. Hence 6? is the UMVUE of o?, since it is a function of 
the complete sufficient statistic and 


E6é? = (n—r)7'E||X — Z|? = 07. 4 


In general, 
Var(I7 8) =17(Z" Z)~ Z7Var(e)Z(Z" Z) 1. (3.35) 
If 1 € R(Z) and Var(e) = o7/,, (assumption A2), then the use of the gen- 
eralized inverse matrix in (3.34) leads to Var(I73) = 0717(Z7 Z)~l, which 
attains the Cramér-Rao lower bound under assumption Al (Proposition 
3.2). 


188 3. Unbiased Estimation 


The vector X — Z{ is called the residual vector and ||X — Z||? is called 
the sum of squared residuals and is denoted by SSR. The estimator 6? is 
then equal to SSR/(n—r). 


Since X — ZB = [In — Z(Z7Z)-Z™|X and IB = I"(Z7Z)-Z"X are 
linear in X, they are normally distributed under assumption Al. Also, 
using the generalized inverse matrix in (3.34), we obtain that 


Fees 4 OA Nae GAA WANA bei ACA A anes AW ALA tae AA A ea 


which implies that 62 and I7@ are independent (Exercise 58 in §1.6) for any 
estimable /7G. Furthermore, 


BAO ae Ae a 
(i.e., Z(Z7Z)~ Z is a projection matrix) and 
SSR = XI, — Z(Z7Z)-Z"|X. 
The rank of Z(Z7 Z)~ Z is tr(Z(Z7 Z)" Z7) =r. Similarly, the rank of the 
projection matrix I, — Z(Z7Z)~ Z7 isn—r. From 
Dad, C= CAVA CAD A way ABP Ge. GA Pen AAA te A0 DG 
and Theorem 1.5 (Cochran’s theorem), $SR/o? has the chi-square distri- 
bution x2_,.(6) with 
(S082 he eo Zp. 
Thus, we have proved the following result. 
Theorem 3.8. Consider model (3.25) with assumption Al. For any es- 
timable parameter 173, the UMVUE’s |7G and 6? are independent; the 


distribution of 178 is N(I78,02I7(Z*Z)~1); and (n — r)62/o? has the chi- 
square distribution y2_,. Il 


Example 3.15. In Examples 3.12-3.14, UMVUE’s of estimable 170 are the 
LSE’s /7G, under assumption Al. In Example 3.13, 


i=1 j=l 
in Example 3.14, if c > 1, 


a b c 


SSR=S°S 0S (Kage — Xiz.)?. 0 


i=1 j=1 k=1 


3.3. The LSE in Linear Models 189 


We now study properties of 178 and 6? under assumption AQ, i.e., with- 
out the normality assumption on ¢. From Theorem 3.6 and the proof of 


Theorem 3.7(ii), 76 (with an 1 € R(Z)) and 6? are still unbiased without 
the normality assumption. In what sense are 173 and 6? optimal beyond 
being unbiased? We have the following result for the LSE /73. Some dis- 


cussion about 6? can be found, for example, in Rao (1973, p. 228). 


Theorem 3.9. Consider model (3.25) with assumption A2. 

(i) A necessary and sufficient condition for the existence of a linear unbiased 
estimator of 17G (i.e., an unbiased estimator that is linear in X) is 1 € R(Z). 
(ii) (Gauss-Markov theorem). If 1 € R(Z), then the LSE 17( is the best 
linear unbiased estimator (BLUE) of 176 in the sense that it has the mini- 
mum variance in the class of linear unbiased estimators of 7G. 

Proof. (i) The sufficiency has been established in Theorem 3.6. Suppose 
now a linear function of X, c7X with c € R”, is unbiased for 176. Then 


UB=E(CX)=CEX =C ZB. 
Since this equality holds for all GB, 1 = Zc, ie., 1E R(Z). 
(ii) Let 1 € R(Z) = R(Z7Z). Then 1 = (Z7Z)¢ for some ¢ and I76 = 
6T(Z7Z)B = C7 Z™X by (3.27). Let c7X be any linear unbiased estimator 
of 173. From the proof of (i), Z7c =1. Then 
Cov(¢7Z7X, 0X — 67 ZX) = E(X7ZCe" X) — E(X*ZCC7Z"X) 
=o" tr(Z¢ce") + 8" Z* Zoe ZB 
=e Itt(Z6C Z)=- 8 2° 26C 2° ZB 
= Cl tere ocr (17 8)? 
= 0. 


Hence 
Var(c’ X) = Var(c’ X —C7Z7X +07 ZX) 
= Var(c’ X — ¢7Z7X) + Var(¢7 Z7 X) 
4: 3Cov(t? Z7 X, 0° X = C7 ZX) 
= Var(cTX — (7Z7X) + Var(I7 B) 
> Var(I73). 0 


3.3.3 Robustness of LSE’s 


Consider now model (3.25) under assumption A3. An interesting ques- 
tion is under what conditions on Var(e) is the LSE of 178 with 1 © R(Z) 
still the BLUE. If [7G is still the BLUE, then we say that [73, considered 
as a BLUE, is robust against violation of assumption A2. In general, a 


190 3. Unbiased Estimation 


statistical procedure having certain properties under an assumption is said 
to be robust against violation of the assumption if and only if the statistical 
procedure still has the same properties when the assumption is (slightly) 
violated. For example, the LSE of [7G with | € R(Z), as an unbiased esti- 
mator, is robust against violation of assumption Al or A2, since the LSE 
is unbiased as long as E'(¢) = 0, which can be always assumed without loss 
of generality. On the other hand, the LSE as a UMVUE may not be robust 
against violation of assumption A1 (see §3.5). 


Theorem 3.10. Consider model (3.25) with assumption A3. The following 
are equivalent. 
(a) 176 is the BLUE of 176 for any 1 € R(Z). 
(b) E(I7 Gn’ X) = 0 for any | € R(Z) and any 7 such that E(7j7X) = 0. 
(c) Z7 Var(e)U = 0, where U is a matrix such that Z7U = 0 and R(U7) + 
(Z7)=R". 
(d) Var(e) = ZA,Z7 +UA2U™ for some Ay and Ag. 
(e) The matrix Z7(Z7 Z)~ Z7 Var(e) is symmetric. 
Proof. We first show that (a) and (b) are equivalent, which is an analogue 
of Theorem 3.2(i). Suppose that (b) holds. Let 1] € R(Z). If c7X is 
unbiased for /76, then E(77.X) =0 with 7 =c— Z(Z7Z)~l. Hence 
Var(c7 X) = Var(c7X — 178 +178) 

= Var(e"X —I"(Z7Z)-Z*X +178) 

= Var(n" X +176) 

= Var(n7 X) + Var(I7 8) + 2Cov(n7 X, 17) 

= Var(n’ X) + Var(I7 3) + 2E(I7 Bn7 X) 

= Var(n7 X) + Var(I7 B) 

> Var(I7 8). 
Suppose now that there are 1 € R(Z) and 7 such that E(77X) = 0 but 
6 = E(l Bn’ X) £0. Let ¢ =ty7 + Z(Z7Z) 1. From the previous proof, 

Var(c] X) = t?Var(n7 X) + Var(I7 8) + 26¢. 

As long as 5 # 0, there exists a t such that Var(c? X) < Var(I73). This 
shows that 173 cannot be a BLUE and, therefore, (a) implies (b). 


We next show that (b) implies (c). Suppose that (b) holds. Since 
lE R(Z), l= Z7y for some y. Let 7 € R(U7). Then E(n7 X) = 77 ZB =0 
and, hence, 


0 = E(I" Bn’ X) = Ely Z(Z7Z)- Z7X Xn] = Z(ZZ)- ZVar(e)n. 
Since this equality holds for all 1 € R(Z), it holds for all y. Thus, 
Z(Z°Z) Z' Var(e)U =0, 


3.3. The LSE in Linear Models 191 


which implies 
ZZ(Z'°Z) Z'Var(e)U = ZVar(e)U = 0, 


since Z7Z(Z7Z) Z7 = Z". Thus, (c) holds. 

To show that (c) implies (d), we need to use the following facts from 
the theory of linear algebra: there exists a nonsingular matrix C such 
that Var(e) = CC™ and C = ZC; + UC for some matrices C; (since 
R(U7) + R(Z7) = R”). Let Ay = CyCT, Ag = C2C3, and A3 = CiCZ. 
Then 

Var(e) — ZAZ" + UAgU™ + ZA3U7 + UN3Z* (3.36) 


and Z7 Var(e)U = Z7ZA3U7U, which is 0 if (c) holds. Hence, (c) implies 
0=Z(Z7Z)-Z7ZAgUTU(UTU)-UT = ZASU", 


which with (3.36) implies (d). 

If (d) holds, then 7(Z7Z)~ Z7 Var(e) = ZA,Z7, which is symmetric. 
Hence (d) implies (e). To complete the proof, we need to show that (e) 
implies (b), which is left as an exercise. I 


As a corollary of this theorem, the following result shows when the 
UMVUE’s in model (3.25) with assumption Al are robust against the vio- 
lation of Var(e) = 07 In. 


Corollary 3.3. Consider model (3.25) with a full rank 7, « = N,(0,%), 
and an unknown positive definite matrix ©. Then [7G is a UMVUE of 176 
for any | € R? if and only if one of (b)-(e) in Theorem 3.10 holds. I 


Example 3.16. Consider model (3.25) with @ replaced by a random vector 
B that is independent of ¢. Such a model is called a linear model with 
random coefficients. Suppose that Var(e) = 07J, and E(@3) = 3. Then 


X= 784:7(8=8) +e= Ze Xe, (3.37) 
where e = Z(8 — 8) +€ satisfies E(e) = 0 and 
Var(e) = ZVar(B)Z7 + 07In. 


Since 
Z(Z" ZY Z" Varle) = ZVar(B)Z" +07°Z(Z" Z) 2" 


is symmetric, by Theorem 3.10, the LSE 173 under model (3.37) is the 
BLUE for any 176, 1€ R(Z). If Z is of full rank and ¢ is normal, then, by 
Corollary 3.3, [7G is the UMVUE of [76 for anyleE R?. - 


192 3. Unbiased Estimation 


Example 3.17 (Random effects models). Suppose that 
Xi; = p+ Aj t+ exj, FSGS 1,...,m, (3.38) 


where  € R is an unknown parameter, A;’s are i.i.d. random variables 
having mean 0 and variance o?, e;;’s are iid. random errors with mean 0 
and variance 0”, and A,’s and e;;’s are independent. Model (3.38) is called 
a one-way random effects model and A;’s are unobserved random effects. 
Let €;; = A; +e;;. Then (3.38) is a special case of the general model (3.25) 
with 
Var(c) = 622 +071, 

where © is a block diagonal matrix whose ith block is Jn; J;,, and Jy, is the k- 


vector of ones. Under this model, Z = Jn, n = 07", ni, and Z(Z7Z)~ Z7 = 
n-*InJ”. Note that 


MIn Jn, NWnJIn, +9: mI ni Jn,, 
tagtea | Mdneda  M2Inadg, - MmInaT, 
nm > 
UInmIn, 2249 nmIng 91° MmInmI ng, 
which is symmetric if and only if ny = ng =--- = nm. Since J,J7 Var(e) 


is symmetric if and only if J,J7% is symmetric, a necessary and sufficient 
condition for the LSE of yz to be the BLUE is that all n,’s are the same. 
This condition is also necessary and sufficient for the LSE of pu to be the 
UMVUE when ¢;;’s are normal. I 


In some cases, we are interested in some (not all) linear functions of 3. 
For example, consider [73 with 1 € R(#), where H is an n x p matrix such 
that R(H) C R(Z). We have the following result. 


Proposition 3.4. Consider model (3.25) with assumption A3. Suppose 
that H is a matrix such that R(H) C R(Z). A necessary and sufficient 
condition for the LSE 173 to be the BLUE of 17 for any | € R(H) is 
H(Z’Z)~ ZVar(e)U = 0, where U is the same as that in (c) of Theorem 
3.10. I 


Example 3.18. Consider model (3.25) with assumption A3 and Z = 
(H, Hz), where H7 Hz = 0. Suppose that under the reduced model 


x = A,B, +E, 


17, is the BLUE for any I7(,, 1 € R(H;1), and that under the reduced 
model 
X= 2B +, 


3.3. The LSE in Linear Models 193 


I7 Gy is not a BLUE for some 1762, 1 € R(H2), where 3 = (1, 2) and B;’s 
are LSE’s under the reduced models. Let H = (H; 0) be n x p. Note that 


H(Z’Z)~ Z7Var(e)U = Hy(H7 Hy)~ HT Var(e)U, 
which is 0 by Theorem 3.10 for the U given in (c) of Theorem 3.10, and 
Z(Z" Z)~ ZVaxr(e)U = Ho(H Hy)~ H2Var(e)U, 


which is not 0 by Theorem 3.10. This implies that some LSE I7B is not a 
BLUE of 178 but 176 is the BLUE of IG ifle R(A). 1 


Finally, we consider model (3.25) with Var(e) being a diagonal matrix 
whose ith diagonal element is ?, i.e., ¢;’s are uncorrelated but have unequal 
variances. A straightforward calculation shows that condition (e) in Theo- 
rem 3.10 holds if and only if, for all i 4 j, 07 # 07 only when hi; = 0, where 
hi is the (i, 7)th element of the projection matrix Z7(Z7Z)~ Z™. Thus, an 
LSE is not a BLUE in general, although it is still unbiased for estimable 
I7B. 

Suppose that the unequal variances of ¢;’s are caused by some small 
perturbations, i.e., ¢; = e; + uj, where Var(e;) = 07, Var(u;) = 6;, and e; 
and wu; are independent so that 0? = 0? + 6;. From (3.35), 


Var(I" 8) =1°(Z27Z)~ N° of ZZ] (Z7Z) 1. 
i=1 
If 6; = 0 for all i (no perturbations), then assumption A2 holds and I7B 


is the BLUE of any estimable 178 with Var(I7 8) = 02I7(Z7Z)~1. Suppose 
that 0 < 6; < 075. Then 


Var(I7B) < (14 d)o7I7(Z7Z)“1. 


This indicates that the LSE is robust in the sense that its variance increases 
slightly when there is a slight violation of the equal variance assumption 
(small 6). 


3.3.4 Asymptotic properties of LSE’s 


We consider first the consistency of the LSE 176 with | € R(Z) for every 
n. 


Theorem 3.11. Consider model (3.25) with assumption A3. Suppose that 
sup,, A+[Var(e)] < oo, where \+[A] is the largest eigenvalue of the matrix 
A, and that limpoA+[(Z7Z)7] = 0. Then 17 is consistent in mse for 


194 3. Unbiased Estimation 


any 1 € R(Z). 
Proof. The result follows from the fact that I7B is unbiased and 


Var(I7 8) = I"(Z7Z)~ Z7Var(e)Z(Z7Z)1 
< A4[Var(e)I7(Z7Z)-L. 0 


Without the normality assumption on ¢, the exact distribution of I7B 
is very hard to obtain. The asymptotic distribution of 17 is derived in the 
following result. 


Theorem 3.12. Consider model (3.25) with assumption A3. Suppose that 
0 < inf, A_[Var(e)], where A_[A] is the smallest eigenvalue of the matrix 
A, and that 

lim max Z7(Z7Z) Z, =0. (3.39) 


noo 1l<i<n 


Suppose further that n = a m, for some integers k, m;, 7 = 1,...,k, 
with m,’s bounded by a fixed integer m, ¢ = (&1,...,€%), & € R'™, and &’s 
are independent. 

(i) If sup; Ele;|?+° < 00, then for any | € R(Z), 


res 6) [want 4 N(0,1). (3.40) 


(ii) Suppose that when m; = m,, 1<i<j<-k, & and €; have the same 
distribution. Then result (3.40) holds for any | € R(Z). 
Proof. Let 1 € R(Z). Then 
I"(Z*Z)-Z7 ZB —IB=0 
and 


I"(@-B) =U (Z"Z) Ze= Yikes 


where c,; is the mj;-vector whose components are I7(Z7Z)~ Z;, 1 = kj-1 + 
1, ..., kj, ko = 0, and kj = S%_, m, j =1,...,&. Note that 


» llengl|? = I7(27Z)- 27 2(Z"Z)- l= (Z7Z)-1. (3.41) 


Also, 


< T — 712 
max, |lenj|[? < m max [I"(2"Z)~ Zi] 


<ml"(Z7Z)"1 max Z7 (ZZ) Zi, 
l<i<n 


3.4. Unbiased Estimators in Survey Problems 195 
which, together with (3.41) and condition (3.39), implies that 


k 
: 12 yo) 
dim, [max lent? / len?) =0 
j=l 
The results then follow from Corollary 1.3. I 


Under the conditions of Theorem 3.12, Var(e) is a diagonal block matrix 
with Var(é;) as the jth diagonal block, which includes the case of indepen- 
dent ¢€;’s as a special case. 


The following lemma tells us how to check condition (3.39). 


Lemma 3.3. The following are sufficient conditions for (3.39). 
(a) A4[(Z7Z)~] > 0 and Z7(Z7Z)" Z, > 0, as n > ov. 
(b) There is an increasing sequence {a,,} such that an — 00, An/an41 — 1, 
and Z7 Z/an, converges to a positive definite matrix. I 
If n71t So, t? 4 cand n~' 37", t; > d im the simple linear regression 
model (Example 3.12), where c is positive and c > d?, then condition (b) in 
Lemma 3.3 is satisfied with a, = n and, therefore, Theorem 3.12 applies. 
In the one-way ANOVA model (Example 3.13), 
Te T -—7 T —y] _ —1 
Eee (2° Z) Z, = r4[(Z7Z) | = pe a ; 
Hence conditions related to Z in Theorem 3.12 are satisfied if and only 
if min; n; — oo. Some similar conclusions can be drawn in the two-way 


ANOVA model (Example 3.14). 


3.4 Unbiased Estimators in Survey Problems 


In this section, we consider unbiased estimation for another type of non- 
iid. data often encountered in applications: survey data from finite pop- 
ulations. A description of the problem is given in Example 2.3 of §2.1.1. 
Examples and a fuller account of theoretical aspects of survey sampling 
can be found, for example, in Cochran (1977) and Sarndal, Swensson, and 
Wretman (1992). 


3.4.1 UMVUE?’s of population totals 


We use the same notation as in Example 2.3. Let X = (X1,...,Xn) bea 
sample from a finite population P = {y1,..., yn} with 


P(X, = Vis eon = Te) = p(s)/n!, 


196 3. Unbiased Estimation 


where s = {%1,...,in} is a subset of distinct elements of {1,...,N} and p is 
a selection probability measure. We consider univariate y;, although most 
of our conclusions are valid for the case of multivariate y;. In many survey 
problems the parameter to be estimated is Y = yee yi, the population 
total. 

In Example 2.27, it is shown that Y=NX=-X a ee Yy; is unbiased for 
Y if p(s) is constant (simple random sampling); a fennel of Var(Y) is also 
given. We now show that Y is in fact the UMVUE of Y under simple ran- 
dom sampling. Let Y be the range of y;, 9 = (y1,...,yn) and 0 = ig ae y. 
Under simple random sampling, the population under consideration is a 
parametric family indexed by 6 € O. 


Theorem 3.13 (Watson-Royall theorem). (i) If p(s) > 0 for all s, then 
the vector of order statistics X(1) < +++: << Xm) is complete for @ € O. 

(ii) Under simple random sampling, the vector of order statistics is suffi- 
cient for 0 € O. 

(iii) Under simple random sampling, for any estimable function of 6, its 
unique UMVUE is the unbiased estimator g(X1,..., Xn), where g is sym- 
metric in its n arguments. 

Proof. (i) Let h(X) be a function of the order statistics. Then h is sym- 
metric in its n arguments. We need to show that if 


E{h(X)] = oa P(S)h Yass Yi,) [nt = 0 (3.42) 
S={ii,...,in}C{1,...,.N} 


for all 6 € 0, then h(yi,,..., yi,,) = 0 for all y;,,...,yi,,. First, suppose that 
all N elements of # are equal to a € Y. Then (3.42) implies h(a,...,a) = 0. 
Next, suppose that N — 1 elements in @ are equal to a and one is b > a. 
Then (3.42) reduces to 


qgh(a, Boer) a) + qah(a, rey Q, b), 


where gq, and g: are some known numbers in (0,1). Since h(a,...,a) = 0 
and q. # 0, h(a,...,a,b) = 0. Using the same argument, we can show 
that h(a,...,a,b,...,b) = 0 for any k a’s and n — k b’s. Suppose next that 
elements of @ are equal to a, b, or c, a < b < c. Then we can show that 
h(a, ..., a, b, ..., b,c, ...,c) = 0 for any k a’s, | b’s, and n—k-—Ic’s. Continuing 
inductively, we see that h(y1,...,yn) = 0 for all possible y1,...,yn. This 
completes the proof of (i). 

(ii) The result follows from the factorization theorem (Theorem 2.2), the 
fact that p(s) is constant under simple random sampling, and 


P(X = Yiny es Xn = Yin) = P(X) = Ways X(n) = Yin))/M! 


where yj.) < +++ < yu, are the ordered values of y;,,..., Yi, - 
(iii) The result follows directly from (i) and (ii). 


3.4. Unbiased Estimators in Survey Problems 197 


It is interesting to note the following two issues. (1) Although we have 
a parametric problem under simple random sampling, the sufficient and 
complete statistic is the same as that in a nonparametric problem (Example 
2.17). (2) For the completeness of the order statistics, we do not need the 
assumption of simple random sampling. 


Example 3.19. From Example 2.27, Y = NX is unbiased for Y. Since Y 


is symmetric in its arguments, it is the UMVUE of Y. We now derive the 
UMVUE for Var(Y). From Example 2.27, 


eg ee (1 om =) 0, (3.43) 


where 


1 
It can be shown (exercise) that E($?) = 07, where $? is the usual sample 
variance 


Since S? is symmetric in its arguments, x (1 — +) S? is the UMVUE of 
Var(Y). I 


Simple random sampling is simple and easy to use, but it is inefficient 
unless the population is fairly homogeneous w.r.t. the y;’s. A sampling 
plan often used in practice is the stratified sampling plan, which can be 
described as follows. The population P is divided into nonoverlapping sub- 
populations P},..., 4 called strata; a sample is drawn from each stratum 
Pn, independently across the strata. There are many reasons for strati- 
fication: (1) it may produce a gain in precision in parameter estimation 
when a heterogeneous population is divided into strata, each of which is 
internally homogeneous; (2) sampling problems may differ markedly in dif- 
ferent parts of the population; and (3) administrative considerations may 
also lead to stratification. More discussions can be found, for example, in 
Cochran (1977). 


In stratified sampling, if a simple random sample (without replacement), 
Xp = (Xn1,---; Xan), is drawn from each stratum, where ny, is the sample 
size in stratum h, then the joint distribution of X = (X,...,Xq) is ina 
parametric family indexed by 0 = (61,...,@H), where 6, = (yi,t © Pr), h= 
1,...,H. Let VY), be the range of y;’s in stratum h and Op, = iA pee Yn, where 
Ny, is the size of P,. We assume that the parameter space is 0 = Th On. 
The following result is similar to Theorem 3.13. 


198 3. Unbiased Estimation 


Theorem 3.14. Let X be a sample obtained using the stratified simple 
random sampling plan described previously. 

(i) For each h, let Zp, be the vector of the ordered values of the sample in 
stratum h. Then (Z1,...,Z#) is sufficient and complete for 6 € O. 

(ii) For any estimable function of @, its unique UMVUE is the unbiased 
estimator g(X) that is symmetric in its first ny arguments, symmetric in 
its second nz arguments,..., and symmetric in its last ny arguments. I 


Example 3.20. Consider the estimation of the population total Y based on 
a sample X = (Xp;,t = 1,...,n2,h = 1,...,H) obtained by stratified simple 
random sampling. Let v, be the populdion total of the hth stratum and 
let ¥, = = N,Xq. , where X;. is the sample mean of the sample from stratum 
h, h = 1,...,H. From Example 2.27, each Y;, is an unbiased estimator of 
Y;,. Let 


Then, by Theorem 3.14, Y, is the UMVUE of Y. Since Yj,...,Yq are 
independent, it follows from (3.43) that 


ya NE (1) g2 
Var(Ysr) = S> i oF, (3.44) 


where 07, = (Nn —1)7* Diep, (Yi — Yn/Nn)?. An argument similar to that 
in Example 3.19 shows that the UMVUE of Var(Y.t) is 


as N? n 
h 
s,=) 07h (1 : a) 82, (3.45) 


where ore is the usual sample variance based on Xh1,...,Xhnj,- 
It is interesting to compare the mse of the UMVUE Vt with the mse of 


the UMVUE Y under simple random sampling (Example 3.19). Let a? be 
given in (3.43). Then 


H H 
(N —1)o? = S (Nn — lok +S 5 Nn( bn — w)*, 
n=l n=l 
where ftp, = Yp/Np is the population mean of the Ath stratum and p = Y/N 
is the overall population mean. By (3.43), (3.44), and (3.45), Var(Y) > 
Var(Y:) if and only if 


A 
2 ae: 
tats (1 - )un—n)? > 7 | (- ) — SSP - §)] oF. 


3.4. Unbiased Estimators in Survey Problems 199 


This means that stratified simple random sampling is better than simple 


random sampling if the deviations uw; — py are sufficiently large. If ea =o 
(proportional allocation), then this condition simplifies to 
H H 
N, 
S- Nn(un — 4)? > (1 = x) ons (3.46) 
h=1 h=1 


which is usually true when p;,’s are different and some N»p’s are large. 


Note that the variances Var(Y) and Var(Ysr) are w.r.t. different sam- 
pling plans under which Y and Y,; are obtained. I 


3.4.2 Horvitz-Thompson estimators 


If some elements of the finite population P are groups (called clusters) of 
subunits, then sampling from P is cluster sampling. Cluster sampling is 
used often because of administrative convenience or economic considera- 
tions. Although sometimes the first intention may be to use the subunits 
as sampling units, it is found that no reliable list of the subunits in the 
population is available. For example, in many countries there are no com- 
plete lists of the people or houses in a region. From the maps of the region, 
however, it can be divided into units such as cities or blocks in the cities. 


In cluster sampling, one may greatly increase the precision of estima- 
tion by using sampling with probability proportional to cluster size. Thus, 
unequal probability sampling is often used. 


Suppose that a sample of clusters is obtained. If subunits within a 
selected cluster give similar results, then it may be uneconomical to measure 
them all. A sample of the subunits in any chosen cluster may be selected. 
This is called two-stage sampling. One can continue this process to have a 
multistage sampling (e.g., cities + blocks — houses — people). Of course, 
at each stage one may use stratified sampling and/or unequal probability 
sampling. 

When the sampling plan is complex, so is the structure of the observa- 
tions. We now introduce a general method of deriving unbiased estimators 
of population totals, which are called Horvitz-Thompson estimators. 


Theorem 3.15. Let X = {y;,i € s} denote a sample from P = {y,..., yn} 
that is selected, without replacement, by some method. Define 


m; = probability thatz es, 71=1,...,N. 


(i) (Horvitz-Thompson). If 7; > 0 for i = 1,...,N and 7; is known when 
i € s, then Ynx = doje yi/7: is an unbiased estimator of the population 


200 3. Unbiased Estimation 


total Y. 
(ii) Define 


™j = probability thatz€ sandjes, i=1,..,N,j=1,...,N. 


Then 
DH Sea ae acne 
i=1 : i=1 j=it1 Od 
N N ' \2 
1 jk Ty Tj 
w=1 j=it41 


Proof. (i) Let a; = lifi es anda, =O ifi ¢ s,i =1,...,N. Then 
E(a;) = 74 and 


eer N 
Bifie) = 8 (So wt) = y=. 


(ii) Since a? = aj, 
Var(a;) = E(a;) = [E(a;)]? = m1 = Ti). 


For i 4 j, 


Then 


N 
Var(Ynt) = Var (>. out) 
: Ti 


N 9 N N 
_ S- * Var(ai) + 25° S- wd Cov(aj, a;) 
a1 i=1 jaiti* 9 
a! 1-7 nee T. ews 
4 2 a eed 
=e pro My 
i=1 i=1 j=i41 Teh, 


Hence (3.47) follows. To show (3.48), note that 


N 
nen and ye Tig = (n-1)m, 
i=1 


j=l... N Ai 


which implies 


3.4. Unbiased Estimators in Survey Problems 201 


Hence 


SS ue UG yy 
Var(Ynr) = 3 S- (mij —maj)( S+35-— 
rae ert Te MP aT 
jai 
N oN 2 
5 teen!) S 
i=1 j=i41 ie ONG 


Using the same idea, we can obtain unbiased estimators of Var(Ynt). 
Suppose that 7,;; > 0 for alli and j and 7;; is known when i € s andj € s. 
By (3.47), an unbiased estimator of Var(Y;,+) is 


meet yay. 8.) 


iES 7 1€8 jESj>i 


By (3.48), an unbiased estimator of Var(Ynt) is 


2. 
w= ane ny Ty (Hb) (3.50) 
i j 


t€8 GES,j>1 


Variance estimators v1; and ve may not be the same in general, but they 

are the same in some special cases (Exercise 92). A more serious problem 

is that they may take negative values. Some discussions about deriving 

better estimators of Var(Yp4) are provided in Cochran (1977, Chapter 9A). 
Some special cases of Theorem 3.15 are considered as follows. 


Under simple random sampling, 7; = n/N. Thus, Y in Example 3.19 is 
the Horvitz-Thompson estimator. 

Under stratified simple random sampling, 7; = np,/Np, if unit ¢ is in stra- 
tum h. Hence, the estimator Y,; in Example 3.20 is the Horvitz-Thompson 
estimator. 

Suppose now each y; € P is a cluster, ie., ys = (Yi,---, Yim; ), where 
M; is the size of the ith cluster, i = 1,..., N. The total number of units in 
P is then M = pees M;. Consider a single-stage sampling plan, i.e., if y; 
is selected, then every y;; is observed. If simple random sampling is used, 


202 3. Unbiased Estimation 


then 7; = k/N, where k is the first-stage sample size (the total sample size 
isn = put M;), and the Horvitz-Thompson estimator is 


LMM eee N 
Waa wae oe 


i€ 8, j=l i€S1 


where s; is the index set of first-stage sampled clusters and Y; is the total 
of the ith cluster. In this case, 


i=1 


If the selection probability is proportional to the cluster size, then 7; = 
kM;/M and the Horvitz-Thompson estimator is 


M; 
- M 1c M Y; 
Yee = Do oy 2a EL 
1E€8)1 j=l 1€ 81 
whose variance is given by (3.47) or (3.48). Usually Var(Ypps) is smaller 
than Var(Y,); see the discussions in Cochran (1977, Chapter 9A). 


Consider next a two-stage sampling in which k first-stage clusters are se- 
lected and a simple random sample of size m, is selected from each sampled 
cluster y;, where sampling is independent across clusters. If the first-stage 
sampling plan is simple random sampling, then 7; = km;/(NM;) and the 
Horvitz-Thompson estimator is 


S.-C 
Y; = cs S- Was DS Vij» 
1E€ 81 . JES 25 


where 82; denotes the second-stage sample from cluster 7. If the first-stage 
selection probability is proportional to the cluster size, then 7; = km;/M 
and the Horvitz-Thompson estimator is 


lime = = we 
iE 81 pa JES2i 


Finally, let us consider another popular sampling method called sys- 
tematic sampling. Suppose that P = {y1,...,yn} and the population size 
N = nk for two integers n and k. To select a sample of size n, we first draw 
a j randomly from {1,...,4}. Our sample is then 


{Yj Yitks Yjt2ky 9 Yir(n—1k}- 


3.4. Unbiased Estimators in Survey Problems 203 


Systematic sampling is used mainly because it is easier to draw a systematic 
sample and often easier to execute without mistakes. It is also likely that 
systematic sampling provides more efficient point estimators than simple 
random sampling or even stratified sampling, since the sample units are 
spread more evenly over the population. Under systematic sampling, 7; = 
k~} for every i and the Horvitz-Thompson estimator of the population total 


is 
n 
Yoy =k > Yj+(t—1)k 
t=1 
The unbiasedness of this estimator is a direct consequence of Theorem 3.15, 


but it can be easily shown as follows. Since j takes value i € {1,...,k} with 
probability k7}, 


The variance of Y is simply 


Var(Yey) Ee dtu — bu), 


where py = 271 yy Yid(t—-1)k and p = ae yi = Y/N. Let o? be 
given in (3.43) and 


k n 
°% = Ema) k(n — 1) yy Yit(t— 1k — Hi)” : 
i=1 t=1 
Then 
k k n 
(W —1)o? =n — WE + ere 
i=1 i=1 t=1 
Thus, 
(N — 1)o? = N~!Var(Yey) + k(n — 1)02, 
and 


Var(Ys,) = N(N —1)o? — N(N — k)o? 
y sy 


Since the variance of the Horvitz-Thompson estimator of the population 
total under simple random sampling is, by (3.43), 

N2 

a (1 = ~) o? = N(k—l)o? 


n N 


the Horvitz-Thompson estimator under systematic sampling has a smaller 
variance if and only if Oy Se, 


204 3. Unbiased Estimation 


3.5 Asymptotically Unbiased Estimators 


As we discussed in §2.5, we often need to consider biased but asymptoti- 
cally unbiased estimators. A large and useful class of such estimators are 
smooth functions of some exactly unbiased estimators such as UMVUE’s, 
U-statistics, LSE’s, and Horvitz-Thompson estimators. Some other meth- 
ods of constructing asymptotically unbiased estimators are also introduced 
in this section. 


3.5.1 Functions of unbiased estimators 


If the parameter to be estimated is J) = g(@) with a vector-valued parameter 
@ and U,, is a vector of unbiased estimators of components of 6 (i.e., EU, = 
0), then T,, = g(U;) is often asymptotically unbiased for J. Assume that g 
is differentiable and c,(Un — 0) +4 Y. Then 


amser, (P) = E{[V9()|"Y}"/en 
(Theorem 2.6). Hence, T,, has a good performance in terms of amse if U;, 
is optimal in terms of mse (such as the UMVUE). 


The following are some examples. 


Example 3.21 (Ratio estimators). Let (X1,Y1),...,(Xn, Yn) be iid. ran- 
dom 2-vectors with E.X, = plz and EY, = ply. Consider the estimation of 
the ratio of two population means: 0 = py//H2 (Uz #0). Note that (Y,X), 
the vector of sample means, is unbiased for (j1y, 12). The sample means are 
UMVUE’s under some statistical models (§3.1 and §3.2) and are BLUE’s 
in general (Example 2.22). The ratio estimator is T, = Y/X. Assume 
that 07 = Var(X1), 07 = Var(¥1), and oxy = Cov(X1,¥) exist. A direct 
calculation shows that the n~! order asymptotic bias of T;, according to 
(2.38) is 
2 Oxy 

br, (P) = a. 
(verify). Using the CLT and the delta-method (Corollary 1.1), we obtain 
that 


2 Wony +0202 
Vn(Ta — 0) ed 


Me 
(verify), which implies 
a, — Woy + iat 
wan , 
In some problems, we are not interested in the ratio, but the use of a 
ratio estimator to improve an estimator of a marginal mean. For example, 


amser, (P) = 


3.5. Asymptotically Unbiased Estimators 205 


suppose that fz is known and we are interested in estimating 1. Consider 
the following estimator: 


fly = (Y/ x )ex- 
Note that ji, is not unbiased; its n~' order asymptotic bias according to 
(2.38) is 
ba, (P) “ doz — Oxy 
Han 
and 


x 


02 — Wor +002 
Comparing ft, with the unbiased estimator Y, we find that fly is asymp- 
totically more efficient if and only if 

20xry > 002, 


which means that jy is a better estimator if and only if the correlation 
between X) and Yj is large enough to pay off the extra variability caused 
by using fiz/X. 


Another example related to a bivariate sample is the sample correlation 
coefficient defined in Exercise 22 in §2.6. 


Example 3.22. Consider a polynomial regression of order p: 
Xi = 32,46, ol DPR 1 


where 3 = (00, (1,++;Bp-1), Zi = (1,ti,...,#?-*), and ¢,’s are iid. with 
mean 0 and variance a? > 0. Suppose that the parameter to be estimated 
is tg € J C R such that 


p-1 p-l 

pois fd 
D_ Aite = max D | Bit. 
j=0 j=0 


Note that tg = g(@) for some function g. Let B be the LSE of 3. Then the 
estimator ig — g(B) is asymptotically unbiased and its amse can be derived 
under some conditions (Exercise 98). Il 


Example 3.23. In the study of the reliability of a system component, we 
assume that 


Xij = 0; 2(t;) + Eij> = 1, Ree 3 j = 1, seey NM. 


Here X;; is the measurement of the ith sample component at time t;; z(t) 
is a g-vector whose components are known functions of the time t; 0;’s 


206 3. Unbiased Estimation 


are unobservable random q-vectors that are i.i.d. from N,(6,%), where 6 
and % are unknown; €;;’s are i.i.d. measurement errors with mean zero 
and variance 07; and 6;’s and ¢;;’s are independent. As a function of t, 
6” z(t) is the degradation curve for a particular component and 67 z(t) is 
the mean degradation curve. Suppose that a component will fail to work if 
0’ z(t) < 7, agiven critical value. Assume that 67 z(t) is always a decreasing 
function of t. Then the reliability function of a component is 


e 07 z(t) uh 
R(t) = Pet) >») = 9 (U2), 

where s(t) = ./[z(t)|7Xz(t) and ® is the standard normal distribution 
function. For a fixed t, estimators of R(t) can be obtained by estimating 
6 and &, since ® is a known function. It can be shown (exercise) that the 
BLUE of @ is the LSE 

Gat ZY 412 x, 
where Z is the m x q matrix whose jth row is the vector z(t;), X; = 
(Xi1,...,Xim), and X is the sample mean of X;’s. The estimation of ¥ is 
more difficult. It can be shown (exercise) that a consistent (as k — oo) 
estimator of & is 


ae 


k 
Rmop E pee CLACAR A tae Aw el 
t=1 


Hence an estimator of ae is 


where 
a(t) = / [z()|7E2(t). 


If we define Yi = X7Z(Z7Z)~12(t), Yio = [X7Z(Z7Z)~12(t)]?, Yis = 
[X7X; — X7Z(Z™Z)-!Z7 Xj]/(m — q), and Y; = (Yin, Yia, Yiz), then it is 
apparent that R(t) can be written as g(Y) for a function 


ae eras 
g(y1,Y2,y3) =® ( yy cinta | 


Suppose that €,; has a finite fourth moment, which implies the existence of 
Var(Y;). The amse of R(t) can be derived (exercise). I 


3.5. Asymptotically Unbiased Estimators 207 


3.5.2 The method of moments 


The method of moments is the oldest method of deriving point estima- 
tors. It almost always produces some asymptotically unbiased estimators, 
although they may not be the best estimators. 

Consider a parametric problem where Xj,..., Xp are i.i.d. random vari- 
ables from Py, 0€ OC R*, and E|Xi|* < co. Let w; = EX] be the jth 


moment of P and let 
~ lay; 
hj=— DRS 
i=1 


be the jth sample moment, which is an unbiased estimator of u;, 7 = 1,..., k. 
Typically, 
fo; = h; (8), ete eee (3.51) 


for some functions h; on R*. By substituting j;’s on the left-hand side of 
(3.51) by the sample moments /i;, we obtain a moment estimator 6, ie., 6 
satisfies ; 

ft; = h, (8), PS dy ey 


which is a sample analogue of (3.51). This method of deriving estimators is 
called the method of moments. Note that an important statistical principle, 
the substitution principle, is applied in this method. 


Let ji = (fi1,..., fix) and h = (hi,...,hp). Then fi = h(6). If the inverse 
function h~! exists, then the unique moment estimator of 6 is 6 = h~1(ji). 
When h~! does not exist (i.e., h is not one-to-one), any solution of fi = h(0) 
is a moment estimator of 0; if possible, we always choose a solution 6 in the 
parameter space ©. In some cases, however, a moment estimator does not 
exist (see Exercise 111). 

Assume that 6 = g(ji) for a function g. If he 1 exists, then g = h-!. If 
g is continuous at ps = (f41,..., We), then 6 is strongly consistent for 6, since 
[lj —a.s. 47 by the SLLN. If g is differentiable at 4 and E|X,|?* < 00, then 
6 is asymptotically normal, by the CLT and Theorem 1.12, and 


amseg(4) = n~"[Vg(u)]"V.Va(u), 


where V,, is a k x k matrix whose (i,7)th element is pui4j; — pitty. Fur- 
thermore, it follows from (2.38) that the n~! order asymptotic bias of 6 
is 

(2n)7"tr (V79(u) Vi) - 


Example 3.24. Let X1,...,X, be i.i.d. from a population Pg indexed by 
the parameter 0 = (1,07), where up = EX, € R and o? = Var(X1) € 
(0,00). This includes cases such as the family of normal distributions, 


208 3. Unbiased Estimation 


double exponential distributions, or logistic distributions (Table 1.2, page 
20). Since EX, = wp and EX? = Var(X1) + (EX)? = 0? + p’, setting 
ji = pp and fig = o? + py? we obtain the moment estimator 


j= (x 1 yn - #7 ss (x. nats) 


i=1 


Note that X is unbiased, but nat? is not. If X; is normal, then 6 is suffi- 
cient and is nearly the same as an optimal estimator such as the UMVUE. 
On the other hand, if X; is from a double exponential or logistic distribu- 
tion, then 6 is not sufficient and can often be improved. 


Consider now the estimation of ¢? when we know that . = 0. Obviously 
we cannot use the equation fi; = ps to solve the problem. Using fig = 2 = 
o*, we obtain the moment estimator 6? = fig = n~')0_, X?. This is 
still a good estimator when X; is normal, but is not a function of sufficient 
statistic when X; is from a double exponential distribution. For the double 
exponential case one can argue that we should first make a transformation 
Y; = |X;| and then obtain the moment estimator based on the transformed 
data. The moment estimator of o? based on the transformed data is Y? = 
(n~! So", |X;|)?, which is sufficient for 07. Note that this estimator can 
also be obtained based on absolute moment equations. 4 


Example 3.25. Let X),....X, be iid. from the uniform distribution on 
(01,02), —co < 01 < 02 < oo. Note that 


EX = (4 &)/)2 


and 
EX? = (67 + 3 + 0162)/3. 


Setting fy = EX, and fig = EX? and substituting 0 in the second equa- 
tion by 2/i1 — 02 (the first equation), we obtain that 


(2jt1 — 02)" + 05 + (2ftr — 02)02 = 3fi2, 
which is the same as 
(82 — fir)” = 3(fi2 — fi). 


Since 02 > EX, we obtain that 


b, =f + /3(@2- A) = X4+ PVs 


and 


3.5. Asymptotically Unbiased Estimators 209 


These estimators are not functions of the sufficient and complete statistic 
(X(1),X(n)). 


Example 3.26. Let Xj,...,X, be ii.d. from the binomial distribution 
Bi(p,k) with unknown parameters k € {1,2,...} and p € (0,1). Since 


EX, = kp 
and 
EX? = kp(1—p) + k’p’, 
we obtain the moment estimators 
B= (fr + A} — fiz) /in = 1—- 248?/X 
and : 7 7 
k= fij/(Aa + (4 — fo) = X/(1- 2545?/X). 
The estimator is in the range of (0,1). But & may not be an integer. It 


can be improved by an estimator that is k rounded to the nearest positive 
integer. I 


Example 3.27. Suppose that X1,...,X, are iid. from the Pareto distri- 
bution Pa(a,0@) with unknown a > 0 and 6 > 2 (Table 1.2, page 20). Note 
that 

EX, = 6a/(0—1) 


and 
EX? = 6a?/(6 — 2). 


From the moment equation, 


(9-1)? fF 5 Uh 
Note that a0=2) ~ l= TOD) Hence 


6(0 — 2) = fi3/(ft2 — 2). 


Since # > 2, there is a unique solution in the parameter space: 


6 = 14 \/fio/ (2 — A?) =14 ,/1+ 2, X?/8? 


and 


a 
lI 


yf + cy X2/8?/ (14 y/1 + 4 X?/8?). 


210 3. Unbiased Estimation 


The method of moments can also be applied to nonparametric problems. 
Consider, for example, the estimation of the central moments 


cj = E(X1 - 1)’, 9 Ds cues Re 


Since 


where fio = 1. It can be shown (exercise) that 
pS oj : 
aj = ne j =2,...,k, (3.52) 


which are sample central moments. From the SLLN, ¢;’s are strongly con- 
sistent. If E|X,|?* < oo, then 


Jn (é2 = C2, ++, €k — Ck) —d Nz_-1i(0, D) (3.53) 
(exercise), where the (7, 7)th element of the (k — 1) x (k — 1) matrix D is 


Ci4jt2 — Cit1Cjti — (t+ ljeicjy42 — G + cizaes + (+ 1)G + Lcicjer. 


3.5.3 V-statistics 


Let Xj, ...,X, be ii.d. from P. For every U-statistic U,, defined in (3.11) as 
an estimator of 0 = E[h(X},...,Xm)], there is a closely related V-statistic 
defined by 


1 n n 

Vv, = vee R(XG,, ++) Xing) 3.54 
ai z » (Xi ) (3.54) 
As an estimator of 0, V, is biased; but the bias is small asymptotically as 
the following results show. For a fixed sample size n, V,, may be better than 
U,, in terms of their mse’s. Consider, for example, the kernel h(21, 22) = 
(x1 — ©2)?/2 in §3.2.1, which leads to 0 = o? = Var(X1) and U,, = S?, the 

sample variance. The corresponding V-statistic is 


I ee (Sy aL =i 
Soya -5 S- (Xi — Xj)? = — S?, 


i=1 j=1 1<i<j<n 


3.5. Asymptotically Unbiased Estimators 211 


which is the moment estimator of a? discussed in Example 3.24. In Exercise 
63 in §2.6, +5? is shown to have a smaller mse than S? when X; is 
normally distributed. Of course, there are situations where U-statistics are 
better than their corresponding V-statistics. 

The following result provides orders of magnitude of the bias and vari- 
ance of a V-statistic as an estimator of ¥V. 


Proposition 3.5. Let V,, be defined by (3.54). 
(i) Assume that E|A(X;,,..., Xi,,)| < co for all 1 < iy < +++ Sim < m. 
Then the bias of V,, satisfies 


by, (P) = O(n"). 


(ii) Assume that E[A(Xi,,..., X:,,)]? < oo for all 1 < iy < +++ Sim < m. 
Then the variance of V,, satisfies 


Var(Vn) = Var(Un) + O(n-?), 


where U,, is given by (3.11). 
Proof. (i) Note that 


Un — Va = [1- amg (Un — Wn), (3.55) 


where W,, is the average of all terms h(X;,,...,.Xi,,) with at least one equal- 
ity im =i, m €l. The result follows from E(U, — W,,) = O(1). 

(ii) The result follows from E(U;,—W,)? = O(1), E[Wn(Un —9)] = O(n7?) 
(exercise), and (3.55). Il 


To study the asymptotic behavior of a V-statistic, we consider the fol- 
lowing representation of V,, in (3.54): 


Ves > nt \v a 
where 


Vij =0+5 S- ve) Gil Xasy id Mae) 


ij=1 9 ij=1 


isa “V-statistic” with 


J 
gj (a1, --., £;) = h;( (11,52 dy [ns (x1,...,£;)dP(x;) 
i=1 
+ > ae 1, ..-,0;)dP(aji, )dP(ai,) — +> 
1<i1 <ie<j 


te -1y f x (a1, ....t;)dP(a1) -- dP(2;) 


212 3. Unbiased Estimation 


and hj(a1,...,0;) = E[h(x1,...,2;,Xj41,...,Xm)]. Using an argument sim- 
ilar to the proof of Theorem 3.4, we can show (exercise) that 


EVZ, = O(n-4), GH 1, aQm, (3.56) 


provided that E[h(Xj,,...,Xi,,)]? < oo for all 1 < i) < +++ < im < m. 
Thus, 


Va — 8 = mVar + 2S YVi2 + op(n), (3.57) 
which leads to the following result similar to Theorem 3.5. 
Theorem 3.16. Let V;, be given by (3.54) with E[h(Xi,,..., Xi,,)]? < co 


for all 1 <7, < +++ < tm <™m. 
(i) If G = Var(hi(X1)) > 0, then 


Jn(Vn — 0) +a N(0,m?¢,). 


(ii) If ¢; = 0 but Cy = Var(ho(X1, X2)) > 0, then 
m(m—1) 
n(Vn — 9) >a mont) So Axis 
j=l 


where x7,;’s and \,’s are the same as those in (3.21). 


Result (3.57) and Theorem 3.16 imply that V,, has expansion (2.37) 
and, therefore, the n~! order asymptotic bias of V;, is Elgo(X1, X1)|/n = 
NEVp2 = m(m — 1) V2 Aj/(2n) (exercise). 

Theorem 3.16 shows that if ¢; > 0, then the amse’s of U, and V, are 
the same. If ¢; = 0 but C2 > 0, then an argument similar to that in the 
proof of Lemma 3.2 leads to 


2 


m?(m—1)2¢2  m?(m—1)? [S 
amsey, (P) = —— ee ae dj 
j=l 
2 
2 54) 2 iad 
= amsey, (P) + el ea Aj 


(see Lemma 3.2). Hence U,, is asymptotically more efficient than V,,, unless 
aay A; = 0. Technically, the proof of the asymptotic results for V,, also 
requires moment conditions stronger than those for U,,. 


Example 3.28. Consider the estimation of ?, where pp = EX 1. From the 
results in §3.2, the U-statistic U, = ary ieee X;,X; is unbiased for 


3.5. Asymptotically Unbiased Estimators 213 


yu”. The corresponding V-statistic is simply V, = X?. If u 40, then ¢, 40 
and the asymptotic relative efficiency of V, w.r.t. Up, is 1. If u = 0, then 


nVn a anys and nUn a oa? (x? —1), 


where y? is a random variable having the chi-square distribution y?. Hence 
the asymptotic relative efficiency of V, w.r.t. Un is 


E(xq — 1)?/E(xi)" = 2/3. 0 


3.5.4 The weighted LSE 


In linear model (3.25), the unbiased LSE of 173 may be improved by a 
slightly biased estimator when Var(e) is not o?/,, and the LSE is not BLUE. 

Assume that Z in (3.25) is of full rank so that every 17 is estimable. 
For simplicity, let us denote Var(¢) by V. If V is known, then the BLUE 
of 17 is 178, where 


P= AVIZ) I AVX (3.58) 


(see the discussion after the statement of assumption A3 in §3.3.1). If V is 
unknown and V is an estimator of V, then an application of the substitution 
principle leads to a weighted least squares estimator 


Gp OV Aya (3.59) 


The weighted LSE is not linear in X and not necessarily unbiased for Z. If 
the distribution of ¢ is symmetric about 0 and V remains unchanged when 
€ changes to —e (Examples 3.29 and 3.30), then the distribution of eens 
is symmetric about 0 and, if EB is well defined, cm is unbiased for 3. In 
such a case the LSE 173 may not be a UMVUE (when ¢ is normal), since 
Var(I7 3) may be smaller than Var(I7). 

Asymptotic properties of the weighted LSE depend on the asymptotic 
behavior of V. We say that V is consistent for V if and only if 


|V-1V — In||max —p 0, (3.60) 
where ||A||max = max;,; |@;;| for a matrix A whose (i, j)th element is a,;. 
Theorem 3.17. Consider model (3.25) with a full rank Z. Let B and By 


be defined by (3.58) and (3.59), respectively, with a V consistent in the 
sense of (3.60). Assume the conditions in Theorem 3.12. Then 


I" (Bw = B)/An td N(O, Ly, 


214 3. Unbiased Estimation 


where 1 € R?, 1 £0, and 
a2 = Var(I7 8) =17(Z7V-!Z)“4. 


Proof. Using the same argument as in the proof of Theorem 3.12, we 
obtain that F 
I"(8 — B)/an >a N(0,1). 


By Slutsky’s theorem, the result follows from 


I" By — 1B = 0p(an). (3.61) 
Define : 
Ee TZ VOI AA aV HV Ye 
and 7 
Gave 2) a a ae: 
Then 


Bi a ITB = En ar Gn- 
Let By = (Z*V-1Z)"!Z7V-1Z — I, and C, = V¥/2V-1V 1/2 — T,. By 


(3.60), ||Cn|lmax = Op(1). For any matrix A, denote ,/tr(A7A) by ||Al]. 
Then 


| Ball? = Ztvs zy te ee Vy eg 
= tr (270-12) ZV ZZ V-1Z)-1) 


IA 


CnlPaaxe te (270-12) (ZV -1Z)2(Z7V-1Z)-1) 
op(1)tr (Ip). 


I 


This proves that ||Bn||max = Op(1). Let An = V/?2V-1V1/2 — I. Using 
inequality (1.37) and the previous results, we obtain that 


é2 = [(I"(Z7V 12) lyty 1/24 V 1/26)? 

EVD VA) a ae ete) 
OIA AVIA aaa 

Op(1)I" (Bn + Ip)?(Z7V"*Z) 11 

Op (Gn). 

Since BI(Z°V— 12) Ve So, MeV) 2 S| = 0,1). 
Define Bin = (Z7V~!Z)"/2B,(Z7V-!Z)-1/2, Then 


IN IA 


l| 


I 


Bin = Ah ish ig a) (cect 0 Va eA A aa A ee 
S lCnhkiwtA VAI Qe VHA ea Ve 
= 0p(1)Jp. 


3.5. Asymptotically Unbiased Estimators 215 


Let Bon = (Z™V—!Z)/2(Z7V—-1Z)-"/2, Since 
| Ban||? = te ((27V-2Z)'/9(27V-1Z) 1 (ZVI ZY?) 
= tr ((2°V-1z)-12°V-1Z) 
= tr(B, + I) 
=p a op(1), 


we obtain that 
|| Ban Bin Bo, || = Op(1). 


Then 
@=(IB oe ea ts lyty—le ig 
=z" 2) VR BE Belg yo lee ae 
SU(ZV AZ) || Bon Bin B5q "(ZV 12) P27 Ve |? 
4 


This proves (3.61) and thus completes the proof. I 


Theorem 3.17 shows that as long as V is consistent in the sense of (3.60), 
the weighted LSE Buy is asymptotically as efficient as B, which is the BLUE 
if V is known. If V is known and ¢ is normal, then Var(I73) attains the 
Cramér-Rao lower bound (Proposition 3.2) and, thus, (3.10) holds with 


Th = LT? Bas 
By Theorems 3.12 and 3.17, the asymptotic relative efficiency of the 
LSE ITB w.r.t. the weighted LSE I7 Buy is 


P(ZV-1Zy 41 
WAID)AIZVA(ZTZ) OV 
which is always less than 1 and equals 1 if IvB is a BLUE (in which case 
B=). 
Finding a consistent Vis possible when V has a certain type of structure. 
We consider three examples. 


Example 3.29. Consider model (3.25). Suppose that V = Var(e) is a 
block diagonal matrix with the ith diagonal block 


La SUS: i=1,...,k, (3.62) 


where m,’s are integers bounded by a fixed integer m, 0? > 0 is an unknown 
parameter, © is a gx q unknown nonnegative definite matrix, U; is an m,; x q 


216 3. Unbiased Estimation 


full rank matrix whose columns are in R(W;), q < inf; m,;, and W; is the 
p xm, matrix such that 27 = (W, W2 ... Wy ). Under (3.62), a consistent 
V can be obtained if we can obtain consistent estimators of 0? and ¥. 


Let X = (Yi,..., Yx), where Y; is an m,-vector, and let R; be the matrix 
whose columns are linearly independent rows of W;. Then 


= 


k 

1 

Y."[Im, — R;(RTR;)~' RT; . 
wages Um: — Ri( Ri Ri)" 7] (3.63) 


is an unbiased estimator of a. Assume that Y;’s are independent and that 


sup; Ele;|?+° < oo for some 6 > 0. Then 6? is consistent for o? (exercise). 
Let r; = Y; - W776 and 


v= 


od 


k 
S> [(U7U;) Uj rir U;(UZ U,)* — 6? (UZU;)—} . (3.64) 
i=1 


It can be shown (exercise) that x is consistent for © in the sense that 
|X — S]|max +p 0 or, equivalently, || — X|| —, 0 (see Exercise 116). 0 


Example 3.30. Suppose that V is a block diagonal matrix with the ith 
diagonal block matrix V,,,, i = 1,...,k, where V; is an unknown ¢t x t matrix 
and m; € {1,...,m} with a fixed positive integer m. Thus, we need to 
obtain consistent estimators of at most m different matrices Vj,...,Vim. It 
can be shown (exercise) that the following estimator is consistent for V;, 
when ki — oo as k > oo: 


A 1 
Y= i Dy riv7,, t=1,...,.m, 
1E Bt 


where r; is the same as that in Example 3.29, B, is the set of 2’s such that 
m,; =t, and k; is the number of i’s in By. 


Example 3.31. Suppose that V is diagonal with the ith diagonal element 
o? = U(Z;), where w is an unknown function. The simplest case is 7)(t) = 
69 + 4,v(Z;) for a known function v and some unknown 6p and 6;. One can 
then obtain a consistent estimator V by using the LSE of 6) and 6; under 
the “model” 


Er? = 4 + 0,0(Z;), i= 1, see Ny (3.65) 


where r; = X; — Z7 B (exercise). If w is nonlinear or nonparametric, some 
results are given in Carroll (1982) and Miiller and Stadrmiiller (1987). I 


Finally, if V is not consistent (i.e., (3.60) does not hold), then the 
weighted LSE 17, can still be consistent and asymptotically normal, but 


3.6. Exercises 217 


its asymptotic variance is not 17(Z7V~!Z)—1I; in fact, 173, may not be 
asymptotically as efficient as the LSE 17 (Carroll and Cline, 1988; Chen 
and Shao, 1993). For example, if 


|V-*U — Inllmax +p 0, 
where U is positive definite, 0 < inf, A_[U] < sup, A+|[U] < oo, and U 4V 


(i.e., V is inconsistent for V), then, using the same argument as that in the 
proof of Theorem 3.17, we can show (exercise) that 


CBs nee B)/bn —d N(0, 1) (3.66) 


for any | # 0, where b?2 = I7(Z7U1Z) 1270 1VU12(Z7U 1Z)11. 
Hence, the asymptotic relative efficiency of the LSE 176 w.r.t. ITB, can be 
less than 1 or larger than 1. 


3.6 Exercises 


1. Let X4,...,X, be iid. binary random variables with P(X; = 1) = 
pé€ (0,1). 
(a) Find the UMVUE of p™, m < n. 
(b) Find the UMVUE of P(X, +---+ Xm =k), where m and k are 
positive integers < n. 
(c) Find the UMVUE of P(X, +--+ + Xn-1 > Xn). 


2. Let Xq,...,Xpn be iid. having the N(y,07) distribution with an un- 
known pz € R and a known o? > 0. 
(a) Find the UMVUE’s of 3 and p+. 
(b) Find the UMVUE’s of P(X, < t) and £P(X, < t) with a fixed 
tER. 


3. In Example 3.4, 
(a) show that the UMVUE of o” is ky_1,-5”, where r > 1—n; 
(b) prove that (X; — X)/S has the p.d-f. given by (3.1); 
(c) show that (X, — X)/S 4 N(0,1) by using (i) the SLLN and (ii) 
Scheffé’s theorem (Proposition 1.18). 


4. Let X1,...,Xm be iid. having the N(u2,07) distribution and let 
Yi,+-,Yn be iid. having the N(uy,0%) distribution. Assume that 
X;’s and Y;’s are independent. 

(a) Assume that pz € R, fly € R, 0% > 0, and of > 0. Find the 
UMVUE’s of fiz — fly and (o,/oy)", where r > 0 and r < n. 

(b) Assume that fiz € R, My € R, and of = of > 0. Find the 
UMVUE’s of 02 and (fz, — Hy)/Ox- 


218 


10. 


3. Unbiased Estimation 


(c) Assume that 2 = fy € R, 02 > 0, of > 0, and o2/o% = 7 is 
known. Find the UMVUE of pz. 

(d) Assume that pe = by € R, 07 > 0, and of > 0. Show that a 
UMVUE of pz does not exist. 

(e) Assume that 2 € R, Wy € R, 02 > 0, anda 
UMVUE of P(X1 <¥). 

(f) Repeat (e) under the assumption that o, = dy. 


; > 0. Find the 


. Let X),..., Xp, be i.i.d. having the uniform distribution on the interval 


(02, 0; +02), where 0, € R, 62 > 0, and n > 2. Find the UMVUE’s 
of 0;, j = 1,2, and 0, /02. 


. Let Xy,...,X» be iid. having the exponential distribution E(a, 0) 


with parameters 0 > 0 andae€ R. 

(a) Find the UMVUE of a when @ is known. 

(b) Find the UMVUE of @ when a is known. 

(c) Find the UMVUE?’s of 0 and a. 

(d) Assume that @ is known. Find the UMVUE of P(X, > t) and 
“P(X, > t) for a fixed t > a. 

(e) Find the UMVUE of P(X, >t) for a fixed t > a. 


|x 
a0 eae 


. Let X1,..., Xp be iid. having the Pareto distribution Pa(a,6) with 


?>0anda> 0. 

(a) Find the UMVUE of # when a is known. 
(b) Find the UMVUE of a when 0 is known. 
(c) Find the UMVUE?’s of a and 0. 


. Consider Exercise 52(a) of §2.6. Find the UMVUE of ¥. 


. Let Xy,..., Xm be iid. having the exponential distribution E(a,, 6.) 


with 6, >O anda, € Rand Yj,..., Y, bei.i.d. having the exponential 
distribution E(a,,6,) with 6, > 0 and ay € R. Assume that X;’s 
and Y;’s are independent. 

(a) Find the UMVUE’s of az — ay and 6,/6y. 

(b) Suppose that 6, = 6, but it is unknown. Find the UMVUE?’s of 
6, and (az — dy)/@z. 

(c) Suppose that a, = a, but it is unknown. Show that a UMVUE 
of az does not exist. 

(d) Suppose that n = m and a, = a, = 0 and that our sample is 
(Z1, Ay), aoe (Zn, An), where Zi = min{ X;, Y;} and A; =1if X; = Y; 
and 0 otherwise, i = 1,...,n. Find the UMVUE of 6, — Oy. 


Let X1,..., Xm be ii.d. having the uniform distribution U(0, 6.) and 
Yi,..., Yn be ii.d. having the uniform distribution U(0, 0,). Suppose 
that X;’s and Y;’s are independent and that 6, > 0 and 6, > 0. Find 
the UMVUE of 6,/0, when n > 1. 


3.6. 


11 


12 


13. 


14. 


15. 


16. 
17. 


18. 


19. 


Exercises 219 


. Let X be a random variable having the negative binomial distribution 
NB(p,r) with an unknown p € (0,1) and a known r. 
(a) Find the UMVUE of p', t <r. 
(b) Find the UMVUE of Var(X). 
(c) Find the UMVUE of log p. 


Let Xj,...,X, be iid. random variables having the Poisson distri- 
bution P(@) truncated at 0, ie, P(X; = x) = (e? — 1)710*/z!, 
z=1,2,...,0>0. Find the UMVUE of 6 when n = 1,2. 


Let X be a random variable having the negative binomial distribution 
NB(p,r) truncated at r, where r is known and p € (0,1) is unknown. 
Let k be a fixed positive integer > r. For r = 1, 2,3, find the UMVUE 
of p*. 

Let X1,...,X, be ii.d. having the log-distribution L(p) with an un- 
known p € (0,1). Let & be a fixed positive integer. 


(a) For n = 1,2, 3, find the UMVUE of p*. 
(b) For n = 1,2,3, find the UMVUE of P(X =k). 


Consider Exercise 43 of §2.6. 

(a) Show that the estimator U = 2(|Xi| — +),x,40} is unbiased for 
6. 

(b) Derive the UMVUE of 6. 


Derive the UMVUE of p in Exercise 33 of §2.6. 


Derive the UMVUE’s of @ and 4 in Exercise 55 of §2.6, based on data 
Xp viay Xie 


Suppose that (Xo, X1,..., X,) has the multinomial distribution in Ex- 
ample 2.7 with pj € (0,1), S¥_9p; = 1. Find the UMVUE of 
po’ +++ p;,*, where r;’s are nonnegative integers with r9 +--+ +rp <n. 


Let Yi,..., Yn be iid. from the uniform distribution U(0,@) with an 
unknown @ € (1,00). 
(a) Suppose that we only observe 


: if Y: > 
xa {% oe re 


tik fY; <1, a 5 uta 
Derive a UMVUE of 6. 
(b) Suppose that we only observe 
Y; if Y¥;<1 ‘ 
X,= oa = 1, Lees 
{ 1 if¥>1, * 7 


Derive a UMVUE of the probability P(Y; > 1). 


220 


20. 


21. 


22. 


23. 


24. 


25. 


26. 


27. 


3. Unbiased Estimation 


Let (X1, Yi), ..-, (Xn, Yn) be ii.d. random 2-vectors distributed as bi- 
variate normal with EX; = EY; = 3z;, Var(X;) = Var(Y;) = 07, and 
Cov(X;, Y;) = po”, i= 1,...,n, where B € R, o > 0, and p € (—1,1) 
are unknown parameters, and z;’s are known constants. 

(a) Obtain a UMVUE of @ and calculate its variance. 

(b) Obtain a UMVUE of o? and calculate its variance. 


Let (X1,Y1),...; (Xn, Yn) bei.id. random 2-vectors from a population 
Pe€P that is the family of all bivariate populations with Lebesgue 
p.d.f.’s. 

(a) Show that the set of n pairs (X;, Y;) ordered according to the value 
of their first coordinate constitutes a sufficient and complete statistic 
for Pe P. 

(b) A statistic T is a function of the complete and sufficient statistic 
if and only if T is invariant under permutation of the n pairs. 

(c) Show that (n — 1)~! 77_, (Xi — X)(¥i — Y) is the UMVUE of 
Cov(X), Yj). 

(d) Find the UMVUE’s of P(X; < Y;) and P(X; < X; and Y; < Y;), 
i#j. 

Let Xj,..., Xn be iid. from P € P containing all symmetric c.d.f.’s 
with finite means and with Lebesgue p.d.f.’s on R. Show that there 
is no UMVUE of w = EX, when n> 1. 


Prove Corollary 3.1. 


Suppose that Tis a UMVUE of an unknown parameter V7. Show that 
T* is a UMVUE of E(T*), where k is any positive integer for which 
EXT") & 6: 


Consider the problem in Exercise 83 of §2.6. Use Theorem 3.2 to show 
that I¢o,(X) is a UMVUE of (1 — p)? and that there is no UMVUE 
of p. 


Let X1,..., Xp be iid. from a discrete distribution with 


P(X; = 9-1) = P(X; = 0) = P(X, = 9 +1) =5, 


where @ is an unknown integer. Show that no nonconstant function 
of 6 has a UMVUE. 


Let X be a random variable having the Lebesgue p.d-f. 


[(1 — 8) + 6/(2V2)]I(0,1) (2), 
where 6 € [0,1]. Show that there is no UMVUE of 0. 


3.6. Exercises 221 


28. 


29. 


30. 


3l. 


32. 


33. 


34. 


35. 


Let X be a discrete random variable with P(X = —1) = 2p(1 — p) 
and P(X = k) = p*(1— p)°-*, k =0,1,2,3, where p € (0,1). 

(a) Determine whether there is a UMVUE of p. 

(b) Determine whether there is a UMVUE of p(1 — p). 


Let X1,...,X» be iid. observations. Obtain a UMVUE of a in the 
following cases. 

(a) X; has the exponential distribution E(a,@) with a known @ and 
an unknown a < 0. 

(b) X; has the Pareto distribution Pa(a,@) with a known 6 > 1 and 
an unknown a € (0, 1]. 


In Exercise 41 of §2.6, find a UMVUE of @ and show that it is unique 
as. 


Prove Theorem 3.3 for the multivariate case (k > 1). 


Let X be a single sample from Py. Find the Fisher information I(0) 
in the following cases. 


(a) Po is the N(y, 07) distribution with 6 = € R. 

(b) Po is the N(p, 2) distribution with 6 = 0? > 0. 

(c) Po is the N(y, 07) distribution with 0 =o > 0. 

(d) Po is the N(o, a) distribution with 6 = 0 > 0. 

(e) Po is the N(u.o ?) distribution with 0 = (11,07) € R x (0,00). 

(f) Po is the negative binomial distribution NBO. r) with 6 € (0,1). 
(g) Po is the gamma distribution I'(a, y) with 6 = (a, y) € (0,00) x 
(0, 00). 

(h) Po is the beta distribution B(a, 3) with 6 = (a, 8) € (0,1) x (0,1). 


Find a function of @ for which the amount of information is indepen- 
dent of 0, when P9 is 

(a) the Poisson distribution P(@) with 0 > 0; 

(b) the binomial distribution Bi(6,r) with 6 € (0,1); 

(c) the gamma distribution ['(a,6) with 6 > 0. 


Prove the result in Example 3.9. 


Obtain the Fisher information matrix for a random variable with 
(a) the Cauchy distribution C(y,0), we R, o > 0; 

(b) the double exponential distribution DE(u, 6), 1 € R, 6 > 0; 

(c) the logistic distribution LG(,c), we R, o > 0; 

(d) the c.d.f. F. (=), where F;, is the c.d.f. of the t-distribution t, 
with a known r, we R, a >0; 

(ec) the Lebesgue p.d.f. f(x) = (1 — )d(@ — pw) + £¢(=*), 0 = 
(u,0,€) € R x (0,00) x (0,1), where ¢ is the standard normal p.d_f. 


222 


36. 


37. 


38. 


39. 


AO. 


Al. 


42. 


43. 


3. Unbiased Estimation 


Let X be a sample having a p.d.f. satisfying the conditions in Theorem 
3.3, where 6 is a k-vector of unknown parameters, and let T(X) be 
a statistic. If T has a p.d.f. go satisfying the conditions in Theorem 
3.3, then we define I7(@) = ELS log go(T)Z log ge(T)]7} to be the 
Fisher information about # contained in T. 

(a) Show that Ix (0) — Ir(@) is nonnegative definite, where Ix (0) is 
the Fisher information about @ contained in X. 

(b) Show that Ix (0) = I7(0) if T is sufficient for 0. 


Let Xj,...,X, be ii.d. from the uniform distribution U(0,6) with 
6 > 0. 

(a) Show that condition (3.3) does not hold for h(X) = X,n). 

(b) Show that the inequality in (3.6) does not hold for the UMVUE 
of 0. 


Prove Proposition 3.3. 


Let X be a single sample from the double exponential distribution 
DE(,0) with w = 0 and 6 > 0. Find the UMVUE’s of the following 
parameters and, in each case, determine whether the variance of the 
UMVUE attains the Cramér-Rao lower bound. 

(a) 0 = 6; 

(b) ¥ = 6", where r > 1; 

(c) V=(14+ 6)71. 


Let X1,..., Xp be ii.d. binary random variables with P(X; = 1) = 
pé€ (0,1). 

(a) Show that the UMVUE of p(1 — p) is T, = nX(1 — X)/(n — 1). 
(b) Show that Var(T,,) does not attain the Cramér-Rao lower bound. 
(c) Show that (3.10) holds. 


Let X1,..., Xp be iid. having the Poisson distribution P(@) with 6 > 
0. Find the amse of the UMVUE of e~*? with a fixed t > 0 and show 
that (3.10) holds. 


Let X1,...,X» be iid. having the N(y,07) distribution with an un- 
known ps € R and a known o? > 0. 

(a) Find the UMVUE of 0 = e“ with a fixed t 4 0. 

(b) Determine whether the variance of the UMVUE in (a) attains the 
Cramér-Rao lower bound. 

(c) Show that (3.10) holds. 


Show that if X4,...,X, arei.i.d. binary random variables, U,, in (3.12) 
equals T(T — 1)---(T — m+ 1)/[n(n — 1)---(n — m + 1)], where 
T= Se Xj. 


3.6. Exercises 223 


44. 


45. 


46. 


AT. 


48. 


49. 


50. 


51. 


52. 


53. 


54. 


55. 
56. 


57. 


58. 


Show that if T, = X, then U,, in (3.13) is the same as the sample 
variance S? in (2.2). Show that (3.23) holds for T,, given by (3.22) 
with E(R?) = o(n-*). 


Prove (3.14), (3.16), and (3.17). 

Let ¢, be given in Theorem 3.4. Show that ¢) < G2 <---< Gm. 
Prove Corollary 3.2. 

Prove (3.20) and show that U, — U; is also a U-statistic. 


Let T;, be asymmetric statistic with Var(T;,) < co for every n and Ty, 
be the projection of T,, on (Y) random vectors {X;j,,..., Xi, }, 1 <i < 
++ < ay <n. Show that E(T,) = E(T,) and calculate E(T, — Ty)?. 


Let Y; be defined in Lemma 3.2. Show that {Y,} is uniformly inte- 
grable. 


Show that (3.22) with E(R?) = o(n~!) is satisfied for T, being a 
U-statistic with E[h(X1, .... Xm)]? < co. 


Let S? be the sample variance given by (2.2), which is also a U- 
statistic (§3.2.1). Find the corresponding hi, he, ¢1, and ¢g. Discuss 
how to apply Theorem 3.5 to this case. 


Let h(x1, #2, 23) = I(_0,0)(€1 + 2 + #3). Define the U-statistic with 
this kernel and find hz and ¢, k = 1,2,3. 


Let Xj,..., Xn be iid. random variables having finite w= EX, and 
p= EX; Find a U-statistic that is an unbiased estimator of pf 
and derive its variance and asymptotic distribution. 


Show that 3 is an LSE of ( if and only if it is given by (3.29). 


Obtain explicit forms for the LSE’s of @;, 7 = 0,1, and SSR, under 
the simple linear regression model in Example 3.11, assuming that 
some ¢,’s are different. 


Consider the polynomial model 
Xj = Bo + Biti + Bot? te; i=1,...,n. 


Find explicit forms for the LSE’s of 3;, 7 = 0,1, 2, and SSR, assuming 
that some t;’s are different. 


Suppose that 
Xi; =a; + Bti; + Ej, = terres a es ee 
Find explicit forms for the LSE’s of @, a;, i =1,...,a, and SSR. 


224 


59. 


60. 


61. 


62. 


63. 


64. 
65. 


66. 


67. 


3. Unbiased Estimation 


Consider the polynomial model 
Xj = Bo + Biti + Got? + Bgt? + ei, i= 1, 2457, 


where ¢;’s are iid. from N(0,07). Suppose that n = 12, t; = —-1, 
i=1,...,4,t;=0,i=5,...,8, andt;=1,i1=9,...,12. 

(a) Obtain the matrix Z7 Z when this polynomial model is considered 
as a special case of model (3.24). 

(b) Show whether the following parameters are estimable: 69 + (2, 


G1, Bo — Pi, Gi + B3, and 89 + G1 + B2 + fs. 


Find the matrix Z, Z7 Z, and the form of 1 € R(Z) under the one-way 
ANOVA model (3.31). 


Obtain the matrix Z under the two-way balanced ANOVA model 
(3.32). Show that the rank of Z is ab. Verify the form of the LSE of 
@ given in Example 3.14. Find the form of 1 € R(Z). 


Consider the following model as a special case of model (3.25): 
Xijk =p + ag + B; + Eijk, 1= i, seey a,j = 1, oeey b, k= 1, wong Cw 


Obtain the matrix Z, the parameter vector 3, and the form of LSE’s 
of 3. Discuss conditions under which | € R(Z). 


Under model (3.25) and assumption A1, find the UMVUE’s of (I73)?, 
I7 3/0, and (I78/c)? for an estimable 17 3. 


Verify the formulas for SSR’s in Example 3.15. 


Consider the one-way random effects model in Example 3.17. Assume 
that n; =n for all i and that A;’s and e;;’s are normally distributed. 
Show that the family of populations is an exponential family with 
sufficient and complete statistics X.., S4 = n>77",(X;. — X..)?, and 
Sm = yer oj (Xa — Xi)’. Find the UMVUE’s of js, 02, and o?. 


Consider model (3.25). Suppose that ¢;’s are i.i.d. with Ee; = 0 and 
a Lebesgue p.d.f. o~' f(x/o), where f is a known Lebesgue p.d.f. and 
oa > 0 is unknown. 

(a) Show that X is from a location-scale family given by (2.10). 

(b) Find the Fisher information about (3,0) contained in X;. 

(c) Find the Fisher information about (3,0) contained in X. 


Consider model (3.25) with assumption A2. Let c € R”. Show that if 
the equation c = Z7y has a solution, then there is a unique solution 
yo € R(Z7) such that Var(yjX) < Var(y7X) for any other solution 
of c= Z7y. 


3.6. Exercises 225 


68. 


69. 


70. 


71. 
72. 


73. 
74. 


79. 


Consider model (3.25). Show that the number of independent linear 
functions of X with mean 0 is n —r, where r is the rank of Z. 


Consider model (3.25) with assumption A2. Let X; = 273, which 
is called the least squares prediction of X;. Let hi; be the (i,7)th 
element of Z7(Z7Z)~ Z7 and hy = hi. Show that 

(5) = o7hi; 

(b) Var(X; —X;) = 0?(1 — hi); 

(c) Cov(Xi, Xj) = 0? hig; 

(d) C ov(Xj — a em ke pe -o*hyt Aa: 

(e) Cov(Xi, Xj — Xj) = 0. 


Consider model (3.25) with assumption A2. Let Z = (Z1, Z2) and 
GB = (81,62), where Z; is n x p; and G; is a p,-vector, 7 = 1,2. 
Assume that (Z7Z1)~! and [Z3 Zz — Z3 Z\(Z7Z)-'Z7 Z| | exist. 
(a) Derive the LSE of @ in terms of Z1, Z2, and X. 

(b) Let B= = (61, 2) be the LSE in (a). Calculate the covariance 
between 3, and Bo. 

(c) Suppose that it is known that G2 = 0. Let (1 be the LSE of (3; 
under the reduced model X = 21; + €. Show that, for any 1 € R”, 
I7 By is better than I7 By i in terms of their mse’s. 


Prove that (e) implies (b) in Theorem 3.10. 


Show that (a) in Theorem 3.10 is equivalent to either 

(f) Var(e)Z = ZB for some matrix B, or 

(g) R(Z™) is generated by r eigenvectors of Var(e), where r is the 
rank of Z. 


Prove Corollary 3.3. 
Suppose that 
X = [dn + HE+e, 


where pt € R is an unknown parameter, J, is the n-vector of 1’s, H 
is an n X p known matrix of full rank, € is a random p-vector with 
E(€) = 0 and Var(€) = ofJp, € is a random n-vector with E(e) = 0 
and Var(e) = 07, and € and e are independent. Show that the LSE 
of ys is the BLUE if and only if the row totals of HH” are the same. 


Consider a special case of model (3.25): 
Xij = w+ ay + Bj + ei, t= 1y.4.5 G59. Say og, 
where ju, a;’s, and 3,’s are unknown parameters, E(¢;;) = 0, Var(e;;) 


= a7, Cov(éx;, €19") =O ifi # v, and Cov(éx;, €1;") = op if 7 # fo: 
Show that the LSE of /7@ is the BLUE for any | € R(Z). 


226 


76. 


77. 
78. 


79. 


80. 


81. 


82. 


83. 


84. 


85. 


3. Unbiased Estimation 


Consider model (3.25) under assumption A3 with Var(e) = a block 
diagonal matrix whose ith block diagonal V; is n; xn; and has a single 
eigenvalue A; with eigenvector J, (the n;-vector of 1’s) and a repeated 
eigenvalue p; with multiplicity n;-1,7=1,...,k, Ser ni =n. Let U 
be the n x k matrix whose 7th column is U;, where U; = (J7,,0,..., 0), 
Ue =O gF ec ni Ue (0, Oy cs See 

(a) If R(Z7) C R(U7) and d; = A, show that 173 is the BLUE of 178 
for any 1 € R(Z). 

(b) If ZU; = 0 for all i and p; = p, show that 178 is the BLUE of 
I7G for any 1 € R(Z). 


Prove Proposition 3.4. 


Show that the condition sup, A+[Var(e)] < co is equivalent to the 
condition sup, Var(e;) < oo. 


Find a condition under which the mse of I7B is of the order n7!. 


Apply it to problems in Exercises 56, 58, and 60-62. 


Consider model (3.25) with iid. €1,...,€n having E(e;) = 0 and 
Var(e;) = 02. Let X; = Z78 and hy = Z7(Z7Z)~ Zj. 
(a) Show that for any € > 0, 


P(|X; — EX;| > €) > min{ P(e; > €/hi), P(e < —e/h,)}. 


(Hint: for independent random variables X and Y, P(|X +Y|>e)> 
PA EPO ae s See = 0).) 
(b) Show that X; — EX; —, 0 if and only if h; — 0. 


Prove Lemma 3.3 and show that condition (a) is implied by {|| Z;||} 
being bounded and \+(Z7Z)~ — 0. 


Consider the problem in Exercise 58. Suppose that {t;;} is bounded. 
Find a condition under which (3.39) holds. 


Under the two-way ANOVA models in Example 3.14 and Exercise 62, 
find sufficient conditions for (3.39). 


Consider the one-way random effects model in Example 3.17. Assume 
that {n;} is bounded and Ele;;|?*+* < oo for some 6 > 0. Show that 
the LSE f@ of yw is asymptotically normal and derive an explicit form 
of Var(ji). 


Suppose that 
X; = plz + €j, t= Lean: 


where p € R is an unknown parameter, t;’s are known and in (a,b), a 
and b are known positive constants, and ¢;’s are independent random 


3.6. Exercises 227 


86. 
87. 


88. 
89. 


90. 


91. 


92. 


93. 


variables satisfying E(e;) = 0, Ele;|?+® < oo for some 6 > 0, and 
Var(e;) = 07t; with an unknown o? > 0. 

(a) Obtain the LSE of p. 

(b) Obtain the BLUE of p. 

(c) Show that both the LSE and BLUE are asymptotically normal 
and obtain the asymptotic relative efficiency of the BLUE w.r.t. the 
LSE. 


In Example 3.19, show that E(.$?) = 0? given in (3.43). 


Suppose that X = (Xj,...,X,) is a simple random sample (without 
replacement) from a finite population P = {y1,..., yn} with univariate 
Ye 

(a) Show that a necessary condition for h(@) to be estimable is that 
h is symmetric in its N arguments. 

(b) Find the UMVUE of Y™, where m is a fixed positive integer < n 
and Y is the population total. 

(c) Find the UMVUE of P(X; < Xj), i# J. 

(d) Find the UMVUE of Cov(X;,X;), i # j. 


Prove Theorem 3.14. 


Under stratified simple random sampling described in §3.4.1, show 
that the vector of ordered values of all Xj,;’s is neither sufficient nor 
complete for 6 € O. 


Let P = {yi,...,.yn} be a population with univariate y;. Define the 
population c.d.f. by F(t) = N71 3, [0.4 (yi). Find the UMVUE 
of F(t) under (a) simple random sampling and (b) stratified simple 
random sampling. 


Consider the estimation of F(t) in the previous exercise. Suppose that 
a sample of size n is selected with 7; > 0. Find the Horvitz-Thompson 
estimator of F(t). Is it a c.d.f.? 


Show that v; in (3.49) and ve in (3.50) are unbiased estimators of 
Var(Ynt). Prove that v; = v2 under (a) simple random sampling and 
(b) stratified simple random sampling. 


Consider the following two-stage stratified sampling plan. In the first 
stage, the population is stratified into H strata and kp, clusters are 
selected from stratum h with probability proportional to cluster size, 
where sampling is independent across strata. In the second stage, a 
sample of mp; units is selected from sampled cluster 7 in stratum h, 
and sampling is independent across clusters. Find 7; and the Horvitz- 
Thompson estimator ee of the population total. 


228 


94. 


95. 


96. 


97. 


98. 


99. 


100. 


101. 


102. 
103. 


3. Unbiased Estimation 


In the previous exercise, prove the unbiasedness of Ypz directly (with- 
out using Theorem 3.15). 


Under systematic sampling, show that Var( Y, ) is equal to 


(1 a x)= ab Ess S- (visum - x) (verona = x) 


i=1 1<t<u<n 


In Exercise 91, discuss how to obtain a consistent (as n — N) esti- 
mator F'(t) of F(t) such that F is a c.d-f. 


Derive the n~! order asymptotic bias of the sample correlation coef- 
ficient defined in Exercise 22 in §2.6. 


Derive the n~! order asymptotic bias and amse of tg in Example 3.22, 
assuming that Sanh G;t? is convex in t € T. 


Consider Example 3.23. 

(a) Show that 6 is the BLUE of 0. 

(b) Show that 6? is unbiased for o?. 

(c) Show that % is consistent for © as k > oo. 
(d) Derive the amse of R(t) as k — oo. 


Let Xj,...,Xn be iid. from N(u,07), where w € R and o? > 0. 
Consider the estimation of ) = E®(a+bX1), where ® is the standard 
normal c.d.f. and a and b are known constants. Obtain an explicit 
form of a function g(j,02) = 0 and the amse of ) = g(X, S?). 


Let X1,..., Xp be i.id. with mean p, variance o, and finite 4; = EX?, 
j = 2,3,4. The sample coefficient of variation is defined to be S/X, 
where $ is the squared root of the sample variance $7. 

(a) If u 4 0, show that /n(S/X — o/) a N(0,T) and obtain an 
explicit formula of 7 in terms of py, 0, and ju;. 

(b) If 4 = 0, show that n~1/2S/X —4 [N(0,1)]7! 


Prove (3.52) and (3.53). 


Let X1,..., Xp bei.i.d. from P in a parametric family. Obtain moment 
estimators of parameters in the following cases. 

(a) P is the gamma distribution I'(a,y), a > 0, y > 0. 

) P is the exponential distribution E(a,0),a€R, 0 > 0. 

) P is the beta distribution B(a, 8), a >0, 6B > 0. 

) P is the log-normal distribution ENLESS ’); wEeR,a>0. 

) P is the uniform distribution U(@ — $,0+4),0€R. 

) P is the negative binomial distribGiion NB(p, r),p € (0,1), r= 


of 


(b 
(c 
(d 
(e 
(f 
1 


2 


9 


3.6. Exercises 229 


104. 


105. 


106. 


107. 


108. 


109. 


110. 


(g) P is the log-distribution L(p), p € (0,1). 
(h) P is the log-normal distribution LN(u,07), we R, o = 1. 
(i) P is the chi-square distribution y7 with an unknown k = 1,2,.... 


Obtain moment estimators of \ and p in Exercise 55 of §2.6, based 
on data Xj,...,Xn. 


Obtain the asymptotic distributions of the moment estimators in Ex- 
ercise 103(a), (c), (e), and (g), and the asymptotic relative efficiencies 
of moment estimators w.r.t. UMVUE’s in Exercise 103(b) and (h). 


In Exercise 19(a), find a moment estimator of 6 and derive its asymp- 
totic distribution. In Exercise 19(b), obtain a moment estimator of 
6—' and its asymptotic relative efficiency w.r.t. the UMVUE of 67+. 


Let Xj,...,X, be iid. random variables having the Lebesgue p.d.f. 
fa,g(v) = ao8~°x°1I,g)(2), where a > 0 and 8 > 0 are unknown. 
(a) Obtain moment estimators of a and 73. 

(b) Obtain the asymptotic distribution of the moment estimators of 
a and @ derived in (a). 


Let X1,..., Xp be i.i.d. from the following discrete distribution: 


21 - 0 
pagan! pmeyee 
2-0 
where 0 € (0,1) is unknown. 
(a) Obtain an estimator of 6 using the method of moments. 
(b) Obtain the amse of the moment estimator in (a). 


Let X1,..., Xn (n > 1) bei.i.d. from a population having the Lebesgue 
p.d.f. 


fale) = (16a —n) + 56(—#), 


oO 


where ¢ is the standard normal p.d.f., 06 = (1,0) € R x (0,00) is 
unknown, and « € (0,1) is a known constant. 

(a) Obtain an estimator of @ using the method of moments. 

(b) Obtain the asymptotic distribution of the moment estimator in 
part (a). 


Let X1,..., Xp be iid. random variables having the Lebesgue p.d_f. 


si Mick eay Ves: eS 0 
fo, 02 (x) = { Ci Ae 92)~ ter / gn< 0, 


where 6; > 0 and 62 > 0 are unknown. 
(a) Obtain an estimator of (1,02) using the method of moments. 


230 


111. 


112. 
113. 


114. 


115. 


116. 


117. 


118. 


119. 


120. 


3. Unbiased Estimation 


(b) Obtain the asymptotic distribution of the moment estimator in 
part (a). 


(Nonexistence of a moment estimator). Consider X1,...,X, and the 
parametric family indexed by (0,7) € (0,1) x {1,2} in Exercise 41 of 
§2.6. Let hi(0,7) = EX}, i=1,2. Show that 


P(fi; = hi(0,7) has a solution) > 0 
as 2 — oo, when X;’s are from the Poisson distribution P(@). 
In the proof of Proposition 3.5, show that E[W;,(Un — ¥)] = O(n~*). 


Assume the conditions of Theorem 3.16. 
(a) Prove (3.56). 
(b) Show that E[g2(X1, X1)|/n = nEVp2 = m(m — 1) pay A; /(2n). 


Let X1,..., Xy be i.i.d. with ac.d.f. F and U, and V,, be the U- and V- 
statistics with kernel [ [J(—20,y)(@1) — Fo(y)|[L(—co,y] (2) — Fo(y)]dFo, 
where Fo is a known c.d.f. 

(a) Obtain the asymptotic distributions of U,, and V, when F # Fo. 
(b) Obtain the asymptotic relative efficiency of U,, w.r.t. V, when 
B= Fo, 


Let X),..., Xn be iid. with a c.d.f. F having a finite sixth moment. 
Consider the estimation of 3, where pp = EX,. When p = 0, find 
amsey¢3(P)/amsey, (P), where U, = (ae Dicicjcken XiXIXE- 


Let Ay, n = 1,2,..., be a sequence of k x k matrices, where k is a 
fixed integer. 

(a) Show that ||An||max — 0 if and only if ||Ap|| — 0, where || Ap|| max 
is defined in (3.60) and ||A,||? = tr(A7,.A,). 

(b) Show that if A,,’s are nonnegative definite, then ||A,,|| — 0 if and 
only if A,[A,] — 0, where A,[A,,] is the largest eigenvalue of A,. 

(c) Show that the result in (a) is not always true if k varies with n. 


Prove that G? in (3.63) is unbiased and consistent for ? under model 
(3.25) with (3.62) and sup, Ele;|?+° < oo for some 6 > 0. Under the 
same conditions, show that » in (3.64) is consistent for © in the sense 
that: || — Slee. +9 0. 


In Example 3.30, show that V; is consistent for V; when ky — oo as 
k= oo. 


Show how to use equation (3.65) to obtain consistent estimators of 0 
and A, ei 


Prove (3.66) under the assumed conditions in §3.5.4. 


Chapter 4 


Estimation in Parametric 
Models 


In this chapter, we consider point estimation methods in parametric models. 
One such method, the moment method, has been introduced in §3.5.2. It 
is assumed in this chapter that the sample X is from a population in a 
parametric family P={Po: 0 € O}, where OCR for a fixed integer k > 1. 


4.1 Bayes Decisions and Estimators 


Bayes rules are introduced in §2.3.2 as decision rules minimizing the average 
risk w.r.t. a given probability measure II on 0. Bayes rules, however, are 
optimal rules in the Bayesian approach, which is fundamentally different 
from the classical frequentist approach that we have been adopting. 


4.1.1 Bayes actions 


In the Bayesian approach, 6 is viewed as a realization of a random vector 0 
whose prior distribution is I]. The prior distribution is based on past expe- 
rience, past data, or a statistician’s belief and thus may be very subjective. 
A sample X is drawn from Py = P,j\g, which is viewed as the conditional 
distribution of X given @ = 6. The sample X = x is then used to obtain an 
updated prior distribution, which is called the posterior distribution and 
can be derived as follows. By Theorem 1.7, the joint distribution of X and 
6 is a probability measure on X x © determined by 


P(Ax B) = | P,(A)dI(0), AE By, BE Bo, 
B 


231 


232 4, Estimation in Parametric Models 


where X is the range of X. The posterior distribution of 6, given X = 2, is 
the conditional distribution P%),, whose existence is guaranteed by Theorem 
1.7 for almost all 2 € X. When P,)9 has a p.d-f., the following result 
provides a formula for the p.d.f. of the posterior distribution Po)q. 


Theorem 4.1 (Bayes formula). Assume that P = {Pzj9 : 0 € O} is 


dominated by a o-finite measure v and f(x) = are (x) is a Borel function 


se - x 0, 0(Bx x Be)). Let II be a prior distribution on ©. Suppose that 


=f fo(a )all > 0. 
@ ‘ ie oe distribution Pg), << II and 
dPo\x _ fol) 
dll m(ax) 
(ii) If I< \ and & = 7(6) for a o-finite measure , then 
dPolw 
dr m(a) 


Proof. Result (ii) an ee result (i) and Proposition 1.7(iii). To show 
(i), we first show that m(ax) < oo a.e. v. Note that 


[me yv= ff Sole yantav = ff fale) )dvdIT=1, (4.2) 


where the second equality follows from Fubini’s theorem. Thus, m(2) is 
integrable w.r.t. vy and m(a) < co ae. v. 


For « € X with m(x) < oo, define 


Pew aay | foley, Be Bo. 


Then P(-,x) is a probability measure on 0 ae. v. By Theorem 1.7, it 
remains to show that 


P(B,x) = P(@ € B|X =2). 


By Fubini’s theorem, P(B,-) is a measurable function of x. Let P,,9 denote 
the “joint” distribution of (X,@). For any A € o(X), 


[oO aPa0= ff foleyantan 
a me  f sateyan] a 
=f I [& ae at} fol («)dvdll 


= | P(B,x)dPyo, 
Axo 


4.1. Bayes Decisions and Estimators 233 


where the third equality follows from Fubini’s theorem. This completes the 
proof. 4 


Because of (4.2), m(a) is called the marginal p.d.f. of X w.r.t. v. If 
m(x) = 0 for an « € X, then f(z) = 0 as. I. Thus, either x should be 
eliminated from X or the prior II is incorrect and a new prior should be 
specified. Therefore, without loss of generality we may assume that the 
assumption of m(a) > 0 in Theorem 4.1 is always satisfied. 


If both X and @ are discrete and v and X are the counting measures, 
then (4.1) becomes 


P(X =2|0 = 0)P(0 = 8) 


PON OES = Sco PX = a0 = 8)PO= 0) 


which is the Bayes formula that appears in elementary probability. 


In the Bayesian approach, the posterior distribution Pg). contains all 
the information we have about @ and, therefore, statistical decisions and 
inference should be made based on Pj, conditional on the observed X = a. 
In the problem of estimating 6, Po), can be viewed as a randomized decision 
rule under the approach discussed in §2.3. 


Definition 4.1. Let A be an action space in a decision problem and 
L(6,a) > 0 be a loss function. For any x € X, a Bayes action w.r.t. I 
is any 6(a) € A such that 


E|L(0,6(x))|X =a] = min E[L(8,a)|X = 4], (4.3) 
aca 
where the expectation is w.r.t. the posterior distribution Pg),. 


The existence and uniqueness of Bayes actions can be discussed under 
some conditions on the loss function and the action space. 


Proposition 4.1. Assume that the conditions in Theorem 4.1 hold; L(6, a) 
is convex in a for each fixed 0; and for each « € X, E[L(0,a)|X = 2] < 
for some a. 

(i) If A is a compact subset of R” for some integer p > 1, then a Bayes 
action 6(x) exists for each x € X. 

(ii) If A = R” and L(@, a) tends to oo as ||a|| > co uniformly in 6 € O97 C O 
with II(Q9) > 0, then a Bayes action (a) exists for each x € X. 

(iii) In (i) or (ii), if L(0, a) is strictly convex in a for each fixed 6, then the 
Bayes action is unique. 

Proof. The convexity of the loss function implies the convexity and con- 
tinuity of E[L(@,a)|X = x] as a function of a with any fixed x. Then, the 
result in (i) follows from the fact that any continuous function on a compact 


234 4, Estimation in Parametric Models 


set attains its minimum. The result in (ii) follows from the fact that 


lim E[L(0,a)|X =a] > lim L(,a)d Po). = 00 


llal| +00 llellc0 Je, 


under the assumed condition in (ii). Finally, the result in (iii) follows from 
the fact that E[L(0,a)|X = 2] is strictly convex in a for any fixed x under 
the assumed conditions. I 


Other conditions on LZ under which a Bayes action exists can be found, 
for example, in Lehmann (1983, §1.6 and §4.1). 


Example 4.1. Consider the estimation of 0 = g(@) for some real-valued 
function g such that [,{g()|?dII < oo. Suppose that A = the range of g(@) 
and L(6,a) = [g(@) — a]? (squared error loss). Using the same argument as 
in Example 1.22, we obtain the Bayes action 


= to 9(9) fo(x)du = to 9(9) fo(x) dll 
m(x) is fo(x)dIl ” 


which is the posterior expectation of g(0), given X = z. 


5(x) (4.4) 


More specifically, let us consider the case where g(@) = 0 for some 
integer j > 1, fo(x) = e~°6"Ipo.1.9,.}(a)/z! (the Poisson distribution) with 
0 > 0, and II has a Lebesgue p.d-f. 7(0) = 0°-1e7 9/719...) (0)/[F(a)y*] 
(the gamma distribution ['(a, y) with known a > 0 and y > 0). Then, for 
pax a ty ae 


nue af ela) GP te te POTD I e5(8); (4.5) 


where c(x) is some function of x. By using Theorem 4.1 and matching the 
right-hand side of (4.5) with that of the p.d.f. of the gamma distribution, 
we know that the posterior is the gamma distribution I'(# + a, y/(y + 1)). 
Hence, without actually working out the integral m(x), we know that c(x) = 
(1+ y71)*t¢/[(a +a). Then 


ita) = ofa) [ wetoteretrt)/ngy, 
0 


Note that the integrand is proportional to the p.d.f. of the gamma distri- 
bution T(j + 2+ a,y7/(y+1)). Hence 


d(x) =e) TG +e+e)/1+y7 1 
=(j@+e+a-1)-"-(e@+ea)/l+7'¥. 


In particular, 6(z) = (e+ a)y/(y+1) whenj=1. J 


4.1. Bayes Decisions and Estimators 235 


An interesting phenomenon in Example 4.1 is that the prior and the 
posterior are in the same parametric family of distributions. Such a prior is 
called a conjugate prior. Under a conjugate prior, Bayes actions often have 
explicit forms (in 7) when the loss function is simple. Whether a prior is 
conjugate involves a pair of families; one is the family P = {fg : 0 € O} 
and the other is the family from which II is chosen. Example 4.1 shows 
that the Poisson family and the gamma family produce conjugate priors. 
It can be shown (exercise) that many pairs of families in Table 1.1 (page 
18) and Table 1.2 (pages 20-21) produce conjugate priors. 


In general, numerical methods have to be used in evaluating the inte- 
grals in (4.4) or Bayes actions under general loss functions. Even under a 
conjugate prior, the integral in (4.4) involving a general g may not have an 
explicit form. More discussions on the computation of Bayes actions are 
given in §4.1.4. 


As an example of deriving a Bayes action in a general decision problem, 
we consider Example 2.21. 


Example 4.2. Consider the decision problem in Example 2.21. Let Pj, 
be the posterior distribution of 6, given X = x. In this problem, A = 
{a1, a2,a3}, which is compact in R. By Proposition 4.1, we know that there 
is a Bayes action if the mean of Po), is finite. Let Eg), be the expectation 
w.r.t. Pz. Since A contains only three elements, a Bayes action can be 
obtained by comparing 


Cy j=l 
Fole[L(8,43)] = 4 cot csEgn(v(0,t)] jg=2 
c3 E92 [P(8, 0)] j = 3, 


where w(0,t) = (8-00 —t)Loy41,00)(0)- I 


The minimization problem (4.3) is the same as the minimization prob- 
lem 


i L(0, 6(x)) fo(a)dT = min f L(0, a) fo(x)dIl. (4.6) 
2) “Jo 


The minimization problem (4.6) is still defined even if II is not a probability 
measure but a o-finite measure on 0, in which case m(x) may not be finite. 
If 11(0) 4 1, IL is called an improper prior. A prior with II(@) = 1 is then 
called a proper prior. An action 6(x) that satisfies (4.6) with an improper 
prior is called a generalized Bayes action. 


The following is a reason why we need to discuss improper priors and 
generalized Bayes actions. In many cases, one has no past information 
and has to choose a prior subjectively. In such cases, one would like to 
select a noninformative prior that tries to treat all parameter values in O 


236 4, Estimation in Parametric Models 


equitably. A noninformative prior is often improper. We only provide one 
example here. For more detailed discussions of the use of improper priors, 
see Jeffreys (1939, 1948, 1961), Box and Tiao (1973), and Berger (1985). 


Example 4.3. Suppose that X = (Xj,...,X,) and X;’s are iid. from 
N(u,07), where w € © C R is unknown and o? is known. Consider the 
estimation of 0 = yz under the squared error loss. If © = [a,b] with —oo < 
a < 6 < ow, then a noninformative prior that treats all parameter values 
equitably is the uniform distribution on [a,b]. If 0 = R, however, the 
corresponding “uniform distribution” is the Lebesgue measure on R, which 
is an improper prior. If II is the Lebesgue measure on R, then 


n 


ve oe xi 5 
(2707) Ph Peso) So Eh ae <0 


By differentiating a in 


n 


(Qne7)-"/? a (u — a)? exp {- 3 | dys 


w=1 


and using the fact that 377, (ai — w)? = Oy, (vi — 2)? + n(%— ps)”, where 
z is the sample mean of the observations 21, ...,%,, we obtain that 


Fa A 
J. exp {—n(@ = w)?/(20?)} dp 


Thus, the sample mean is a generalized Bayes action under the squared 
error loss. From Example 2.25 and Exercise 91 in §2.6, if Il is N(0, 08), 
then the Bayes action is (x) in (2.25). Note that in this case Z is a limit 
of px(z) asa? 400. I 


4.1.2 Empirical and hierarchical Bayes methods 


A Bayes action depends on the chosen prior that may depend on some pa- 
rameters called hyperparameters. In §4.1.1, hyperparameters are assumed 
to be known. If hyperparameters are unknown, one way to solve the prob- 
lem is to estimate them using data 71,...,%; the resulting Bayes action is 
called an empirical Bayes action. 


The simplest empirical Bayes method is to estimate prior parameters 
by viewing x = (#1,...,%,) as a “sample” from the marginal distribution 


Pye(A) = ff Pap( Aa AE BL, 


4.1. Bayes Decisions and Estimators 237 


where IIg)¢ is a prior depending on an unknown vector € of hyperparameters, 
or from the marginal p.d.f. m(a) in (4.2), if P,)9 has ap.d.f. fg. The method 
of moments introduced in §3.5.3, for example, can be applied to estimate 
€. We consider an example. 


Example 4.4. Let X = (X1,..., Xn) and X;’s be iid. from N(y, 07) with 
an unknown js € R and a known o?. Consider the prior I,j¢ = N(0, 09) 
with € = (449,0@). To obtain a moment estimate of €, we need to calculate 


f xym(a«)dx and i rim(a)dz, 


where x = (#1,...,U). These two integrals can be obtained without calcu- 
lating m(a). Note that 


{ sim(a)de = ff «1 f,(x)dadI1,\¢ =i, pd je = Ho 
R eJR” So 


and 
| rjm(a)dx = | ‘i: vi fy(x)dedl ye = a+ | wdll je = 07° +p +06. 
Rn eJR R 


Thus, by viewing 21,...,2%, as a sample from m(), we obtain the moment 
estimates 


P - : 1X E 
fio = & and 60 = — y (2; -— Z)? —o°, 
i=1 


where Z is the sample mean of 2;’s. Replacing fo and o@ in formula (2.25) 
(Example 2.25) by fio and 6%, respectively, we find that the empirical Bayes 
action under the squared error loss is simply the sample mean Z (which is 
a generalized Bayes action; see Example 4.3). I 


Note that 6? in Example 4.4 can be negative. Better empirical Bayes 
methods can be found, for example, in Berger (1985, 84.5). The follow- 
ing method, called the hierarchical Bayes method, is generally better than 
empirical Bayes methods. 

Instead of estimating hyperparameters, in the hierarchical Bayes ap- 
proach we put a prior on hyperparameters. Let Igj¢ be a (first-stage) prior 
with a hyperparameter vector € and let A be a prior on =, the range of €. 
Then the “marginal” prior for 6 is defined by 


II(B) = [ Tgje(B)dA(€), BE Bo. (4.7) 


If the second-stage prior A also depends on some unknown hyperparameters, 
then one can go on to consider a third-stage prior. In most applications, 


238 4, Estimation in Parametric Models 


however, two-stage priors are sufficient, since misspecifying a second-stage 
prior is much less serious than misspecifying a first-stage prior (Berger, 
1985, §4.6). In addition, the second-stage prior can be chosen to be nonin- 
formative (improper). 

Bayes actions can be obtained in the same way as before using the prior 
n (4.7). Thus, the hierarchical Bayes method is simply a Bayes method 
with a hierarchical prior. Empirical Bayes methods, however, deviate from 
the Bayes method since 2,...,%,, are used to estimate hyperparameters. 


Suppose that X has a p.d.f. fo(x) w.r.t. a o-finite measure v and IIg\¢ 
has a p.d.f. wj¢(@) w.r.t. a o-finite measure k. Then the prior IT in (4.7) 
has a p.d.f. 


(0) = [ re(0)aA(8 


m(o)= ff toleynae(@)ddar. 


Let Pojz,¢ be the posterior distribution of @ given x and € (or € is assumed 


known) and 
Moje (x =f fo(x)mo\¢(0 


which is the marginal of X given € (or € is assumed known). Then the 
posterior distribution FP), has a p.d-f. 


Cle = foley) = 


-[= ie pom dN(€) 
fo(x pra) Male ( 
-{[ ts —— Male\7) GA(E) 


- {= dP 5 |x, é Mia ap, 


where Pe), is the posterior distribution of € given x. Thus, under the 
estimation problem considered in Example 4.1, the (hierarchical) Bayes 
action is 


w.r.t. K and 


; 


a iE 5(w,2)4Pa2, (4.8) 


where 6(x, €) is the Bayes action when € is known. A result similar to (4.8) 
is given in Lemma 4.1. 


Example 4.5. Consider Example 4.4 again. Suppose that one of the 
parameters in the first-stage prior N(uo,02), Wo, is unknown and o@ is 


4.1. Bayes Decisions and Estimators 239 


known. Let the second-stage prior for € = p19 be the Lebesgue measure on 
R (improper prior). From Example 2.25, 


2 2 
ror no, 

6 xv CC ee Oe 

(7,6) noe + 0? noe + 0? 


To obtain the Bayes action 6(x), it suffices to calculate Fg,(€), where the 
expectation is w.r.t. Pej;. Note that the p.d-f. of Pe), is proportional to 


°e n(#—u)2 _ #2 
vo=f exp {2 -e ay. 


Using the properties of normal distributions, one can show that 
es aes é e 
n 
pg) = Crexp | (sts + at) i— v wa) = =} 


= Coexp {- a + eee 


= Crexp J — n(é—a)* 
3€XP | — BineFt0%) f? 


M9 


where C1, C2, and C3 are quantities not depending on €. Hence E¢)\,(€) = &. 
The (hierarchical) generalized Bayes action is then 


: nos 
= B ae 
( ) noe + o2 ela (E) a a x 


4.1.3 Bayes rules and estimators 


The discussion in §4.1.1 and §4.1.2 is more general than point estimation 
and adopts an approach that is different from the frequentist approach used 
in the rest of this book. In the frequentist approach, if a Bayes action 6(2) 
is a measurable function of x, then 6(X) is a nonrandomized decision rule. 
It can be shown (exercise) that 6(X) defined in Definition 4.1 (if it exists 
for X =a € A with J, Po(A)dII = 1) also minimizes the Bayes risk 


= i: Rr(6)all 


over all decision rules T (randomized or nonrandomized), where Rr(@) is 
the risk function of T defined in (2.22). Thus, 6(X) is a Bayes rule (§2.3.2). 
In an estimation problem, a Bayes rule is called a Bayes estimator. 
Generalized Bayes risks, generalized Bayes rules (or estimators), and 
empirical Bayes rules (or estimators) can be defined similarly. 
In view of the discussion in §2.3.2, even if we do not adopt the Bayesian 
approach, the method described in §4.1.1 can be used as a way of generating 


240 4, Estimation in Parametric Models 


decision rules. In this section, we study a Bayes rule or estimator in terms 
of its risk (and bias and consistency for a Bayes estimator). 


Bayes rules are typically admissible since, if there is a rule better than 
a Bayes rule, then that rule has the same Bayes risk as the Bayes rule 
and, therefore, is itself a Bayes rule. This actually proves part (i) of the 
following result. The proof of the other parts of the following result is left 
as an exercise. 


Theorem 4.2. In a decision problem, let 6(X) be a Bayes rule w.r.t. a 
prior II. 

(i) If 6(X) is a unique Bayes rule, then 6(X) is admissible. 

(ii) If © is a countable set, the Bayes risk r, (II) < co, and II gives positive 
probability to each 6 € ©, then 6(X) is admissible. 

(iii) Let S be the class of decision rules having continuous risk functions. If 
6(X) € S, r,(ID) < co, and II gives positive probability to any open subset 
of 0, then 6(X) is S-admissible. 


Generalized Bayes rules or estimators are not necessarily admissible. 
Many generalized Bayes rules are limits of Bayes rules (see Examples 4.3 
and 4.7). Limits of Bayes rules are often admissible (Farrell, 1968a,b). The 
following result shows a technique of proving admissibility using limits of 
generalized Bayes risks. 


Theorem 4.3. Suppose that @ is an open set of R”. In a decision problem, 
let S be the class of decision rules having continuous risk functions. A 
decision rule T € & is S-admissible if there exists a sequence {II,} of 
(possibly improper) priors such that (a) the generalized Bayes risks r, (IL,) 
are finite for all j; (b) for any 4) € 8 and 7 > 0, 


tim 22 (IL) — rj Uj) 
in 
Joo I; (Ob0,n) 


where rj (Uj) = infpeg rp (Uj) and Oo.) = {8 € O : ||@ — Al] < n} with 
II; (O6,,n) < 00 for all 7. 

Proof. Suppose that T is not S-admissible. Then there exists Ty € S such 
that Rr (0) < Rr(@) for all @ and Rr,(00) < Rr(Oo) for a 0) € O. From 
the continuity of the risk functions, we conclude that Rr, (0) < Rr(@) — «€ 


for all 6 € Og, and some constants € > 0 and 7 > 0. Then, for any j, 


= 0, 


r, (Lj) — 77 (Qj) = r- (yj) — rz, (Oy) 
> | (Rp() — Rr, (6)|dH1, (6) 
Oo0.n 
2 ell; (Ob,,2); 


which contradicts condition (b). Hence, T is S-admissible. I 


4.1. Bayes Decisions and Estimators 241 


Example 4.6. Consider Example 4.3 and the estimation of 4 under the 
squared error loss. From Theorem 2.1, the risk function of any decision rule 
is continuous in y if the risk is finite. We now apply Theorem 4.3 to show 
that the sample mean X is admissible. Let Il; = N(0,j). Since Rx (u) = 
o*/n, rx(ULj) = o?/n for any j. Hence, condition (a) in Theorem 4.3 is 
satisfied. From Example 2.25, the Bayes estimator w.r.t. II; is 6;(X) = 


prams (see formula (2.25)). Thus, 


and 


For any Ojon = {: | — Hol < nf, 


for some ; satisfying (Wo — 7)/V7 < & < (uo + 7)/Vj, where © is the 
standard normal c.d.f. and ©’ is its derivative. Since ®’(€;) — (0) = 


j) = 75 (Uy) oV5 
Tj (Oyo,n) 2n®! (Ej )n(nj + 0) 


as j — oo. Thus, condition (b) in Theorem 4.3 is satisfied and, hence, the 
sample mean X is admissible.  & 


r 


a 
—~ 
qo 
& 
ae 


= 0 


More results in admissibility can be found in §4.2 and 84.3. 
The following result concerns the bias of a Bayes estimator. 


Proposition 4.2. Let 6(X) be a Bayes estimator of 0 = g(@) under 
the squared error loss. Then 6(X) is not unbiased unless the Bayes risk 
r, (II) = 0. 

Proof. Suppose that 6(X) is unbiased, i.e., E[5(X)|@] = g(@). Condition- 
ing on @ and using Proposition 1.10, we obtain that 


Blg(8)5(X)] = Efg(@)E[5(X)|6]} = Blg()). 


Since 6(X) = E[g(0)|X], conditioning on X and using Proposition 1.10, we 
obtain that 


E(g(@)5(X)] = E{5(X)E[g(9)|X]} = B[6(X)]’. 


242 4, Estimation in Parametric Models 


Since r, (II) = 0 occurs usually in some trivial cases, a Bayes estimator 
is typically not unbiased. Hence, Proposition 4.2 can be used to check 
whether an estimator can be a Bayes estimator w.r.t. some prior under 
the squared error loss. However, a generalized Bayes estimator may be 
unbiased; see, for instance, Examples 4.3 and 4.7. 


Bayes estimators are usually consistent and approximately unbiased. In 
a particular problem, it is usually easy to check directly whether Bayes 
estimators are consistent and approximately unbiased (Examples 4.7-4.9), 
especially when Bayes estimators have explicit forms. Bayes estimators also 
have some other good asymptotic properties, which are studied in §4.5.3. 


Let us consider some examples. 


Example 4.7. Let X = (Xj,...,X,) and X;,’s be i.i.d. from the exponential 
distribution E(0, 0) with an unknown 6 > 0. Let the prior be such that 0~+ 
has the gamma distribution ['(a,7) with known a > 0 and y > 0. Then 
the posterior of w = 6~! is the gamma distribution '(n +a, (nX +y~!)7!) 
(verify), where X is the sample mean. 

1 


Consider first the estimation of 6 = w ~~. 
under the squared error loss is 
(nX + y7h)te 
T'(n + a) 
The bias of 6(X) is 


nod+y71 9 - Ya fe 16 (=) 


The Bayes estimator of 0 


nX +771 


el nta-1 


oo 
v1_-1 
) yrtan2o—(nX+7 )” day = 
0 


nta—-1 n+ta-1l 


It is also easy to see that 6(X) is consistent. The UMVUE of @ is X. 
Since Var(X) = 6?/n, r, (I) > 0 for any IT and, hence, X is not a Bayes 
estimator. In this case, X is the generalized Bayes estimator w.r.t. the 
improper prior 4 = I.) (w) and is a limit of Bayes estimators 5(X) as 
a — 1 and y — ov (exercise). The admissibility of 6(X) is considered in 
Exercises 32 and 80. 


Consider next the estimation of e~'/’ = e~™ (see Examples 2.26 and 
3.3). The Bayes estimator under the squared error loss is 


b:(X) = ae ae rta-1_—(nX+7~* +4) gy 
T'(n + a) 5 


t —(n+a) 
1+ ——_ : 
( a) 


Again, this estimator is biased and it is easy to show that 6;(X) is consistent 
as n — oo. In this case, the UMVUE given in Example 3.3 is neither a 
Bayes estimator nor a limit of 6:(X). 


t/0 tw 


l 


4.1. Bayes Decisions and Estimators 243 


Example 4.8. Let X = (Xj,...,X,) and X;’s be iid. from N(p,07) 
with unknown pp € R and o? > 0. Let the prior for w = (207)~! be the 
gamma distribution ['(a,y) with known a and ¥ and let the prior for 
be N(uo,09/w) (conditional on w). Then the posterior p.d.f. of (u,w) is 
proportional to 


wwlrtD/240-1 exp f ca LY 4n(X — py)? 4 user | wh, 


205 
where Y = )7\_,(X; — X)? and X is the sample mean. Note that 
% 2) (u=Ho)? _ 1 2 ¥ 72) Ho 
n(X — p) + ws — (n+ she) =~ 2(nX + Hy) w+ nX + 352° 
Hence, the posterior p.d.f. of (44,w) is proportional to 


wrt D/240-} exp f a +W 4 (n | xz) (w(x) o} 


0 


where 


WX + 3% ee 1 4 
X) =) and W=Ytnx?+/0- ~~ | [¢(x)]. 
(X) ae tak? +97 (n+ sa) te )] 


0 


Thus, the posterior of w is the gamma distribution ['(n/2+a, (y~'+W)~') 
and the posterior of jz (given w and X) is N(¢(X), [(2n+097)w]~+). Under 
the squared error loss, the Bayes estimator of yz is ¢(X) and the Bayes 
estimator of 0? = (2w)~1 is (y~!+W)/(n+2a—2), provided that n+2a > 2. 
Apparently, these Bayes estimators are biased but the biases are of the order 
n—'; and they are consistent asn— oo. I 


To consider the last example, we need the following useful lemma whose 
proof is similar to the proof of result (4.8). 


Lemma 4.1. Suppose that X has a p.d.f. fo(a) w.r.t. a o-finite measure 
v. Suppose that 6 = (91, 62), 8; € ©,;, and that the prior has a p.d.f. 


m(0) = 761 |62 (91) 70. (92), 


where 79,(02) is a p.d.f. w.r.t. a o-finite measure v2 on O2 and for any 
given 02, 79,\9,(91) is a p.df. w.r-t. a o-finite measure 1; on ©). Suppose 
further that if 02 is given, the Bayes estimator of h(61) = g(01, 02) under 
the squared error loss is 6(X,@2). Then the Bayes estimator of g(61, 62) 
under the squared error loss is 6(X) with 


d(x) = 5(X, 02) Pos \x(92)dv2, 
©2 


where pg,)2(92) is the posterior p.d.f. of 2 given X =x. I 


244 4, Estimation in Parametric Models 


Example 4.9. Consider a linear model 
Xi; =P Zi 4 €i, qe sas G5 = lesa hy 


where @ € R? is unknown, Z;,’s are known vectors, €;;’s are independent, 
and ¢;; is N(0,0?), j = 1,...,.ni, 7 = 1,...,k. Let X be the sample vector 
containing all X;,;’s. The parameter vector is then 6 = (G,w), where w = 
(w1,...,We) and w; = (207)~!. Assume that the prior for 6 has the Lebesgue 


p.d.f. 
k 


en(8) [ [wre 7, (4.9) 
i=1 
where a > 0, y > 0, and c > 0 are known constants and 7(@) is a known 
Lebesgue p.d.f. on R”. The posterior p.d.f. of @ is then proportional to 


k 
h(X, 0) = 0(B) [written tr +e los, 
41. 


where v;(3) = )0%",(Xij — 67 Z;)*. If 6 is known, the Bayes estimator of 


o? under the squared error loss is 


| PLCS ma 
Qu; f h(X, O)dw ~~ %atn ° 


By Lemma 4.1, the Bayes estimator of 0? is 


6? [ae 


= 2a+ nj; 


a 


fajx (2)dpB, (4.10) 


where 


k 
x m(f) J] f opted 
t=1 
: (at+1+m:/2) 
x (8) |] [* + w(8)] : (4.11) 
t=1 


is the posterior p.d.f. of G. The Bayes estimator of [7G for any | € RP? is 
then the posterior mean of /73 w.r.t. the p.d.f. fg) x (8). 

In this problem, Bayes estimators do not have explicit forms. A nu- 
merical method (such as one of those in §4.1.4) has to be used to evaluate 
Bayes estimators (see Example 4.10). 


4.1. Bayes Decisions and Estimators 245 


Let X;. and S$? be the sample mean and variance of Xi, J = 1.4 
S? is defined to be 0 if n; = 1), and let of = (2ay)~' (the prior mean of 


(S; 
a7). Then the Bayes estimator 6? in (4.10) can be written as 


2a 9 Mil Co Ni 
sant 
2at nj; 2a+n; 2at+n; 


‘, (Xi. — BZ)? fayx(Bas. (4.12) 


The Bayes estimator in (4.12) is a weighted average of prior information, 
“within group” variation, and averaged squared “residuals”. 


If n; — oo, then the first term in (4.12) converges to 0 and the second 
term in (4.12) is consistent and approximately unbiased for 07. Hence, 
the Bayes estimator 6? is consistent and approximately unbiased for o? if 
the mean of the last term in (4.12) tends to 0, which is true under some 
conditions (see, e.g., Exercise 36). It is easy to see that G? is consistent and 
approximately unbiased for o? w.r.t. the joint distribution of (X, 6), since 
the mean of the last term in (4.12) w.r.t. the joint distribution of (X, 0) is 
bounded by o@/nj. 


4.1.4 Markov chain Monte Carlo 


As we discussed previously, Bayes actions or estimators have to be com- 
puted numerically in many applications. Typically we need to compute an 
integral of the form 


with some function g, where p(0) is a p.d.f. w.r.t. a o-finite measure v on 
(90, Boe) and O c R*. For example, if g is an indicator function of A € Be 
and p(6) is the posterior p.d.f. of 6 given X = x, then E,(g) is the posterior 
probability of A; under the squared error loss, E,(g) is the Bayes action 
(4.4) if p(0) is the posterior p.d.f. 

There are many numerical methods for computing integrals E,(g); see, 
for example, §4.5.3 and Berger (1985, §4.9). In this section, we discuss 
the Markov chain Monte Carlo (MCMC) methods, which are powerful nu- 
merical methods not only for Bayesian computations, but also for general 
statistical computing (see, e.g., §4.4.1). 

We start with the simple Monte Carlo method, which can be viewed as a 
special case of the MCMC. Suppose that we can generate iid. 0), ...,9(™ 
froma p.d.f. h(@) > 0 w.r.t. v. By the SLLN (Theorem 1.13(ii)), as m — on, 


Z Be (3) (9) 
E,(g) = : ye Saree as. if HPO) 1g) ay = Ey(g). 


m 


Hence E(g) can be used as a numerical approximation to E,(g). The 
process of generating 6% according to h is called importance sampling and 


246 4, Estimation in Parametric Models 


h(@) is called the importance function. More discussions on importance 
sampling can be found, for example, in Berger (1985), Geweke (1989), Shao 
(1989), and Tanner (1996). When p(@) is intractable or complex, it is 
often difficult to choose a function h that is simple enough for importance 
sampling and results in a fast convergence of E,(g) as well. 

The simple Monte Carlo method, however, may not work well when k, 
the dimension of ©, is large. This is because, when & is large, the conver- 
gence of E,(g) requires a very large m; generating a random vector from 
a k-dimensional distribution is usually expensive, if not impossible. More 
sophisticated MCMC methods are different from the simple Monte Carlo 
in two aspects: generating random vectors can be done using distributions 
whose dimensions are much lower than k; and 0@),...,0°™ are not inde- 
pendent, but form a Markov chain. 

Let {Y : ¢ = 0,1,...} be a Markov chain (§1.4.4) taking values in 
YcR*. {Y} is homogeneous if and only if 


P(YY) € AY) = P(Y™ © AlY) 
for any t. For a homogeneous Markov chain {Y“}, define 
P(y,A)=PYYO EAYY =y), ye, Ac By, 


which is called the transition kernel of the Markov chain. Note that P(y, -) 
is a probability measure for every y € Y; P(-, A) is a Borel function for every 
A € By; and the distribution of a homogeneous Markov chain is determined 
by P(y, A) and the distribution of Y) (initial distribution). MCMC ap- 
proximates an integral of the form Jy g(y)p(y)dv by m~* S77", g(¥) with 
a Markov chain {Y“ : t = 0,1,...}. The basic justification of the MCMC 
approximation is given in the following result. 


Theorem 4.4. Let p(y) be a p.d-f. on Y w.r.t. a o-finite measure v and g be 
a Borel function on Y with fy |9(y)|p(y)dv < oo. Let 1V Ot 0,1.) Be 
a homogeneous Markov chain taking values on Y Cc R* with the transition 
kernel P(y, A). Then 


1 m 
= g(¥) a i a(y)p(y)av (4.13) 
ey J 
and, ast > ~, 
Pty, A) = P(Y € AYO =y) as. | p(y)dv, (4.14) 
A 


provided that 
(a) the Markov chain is aperiodic in the sense that there does not exist d > 2 


4.1. Bayes Decisions and Estimators 247 


nonempty disjoint events Ao, ..., Ag—1 in By such that for alli = 0,...,.d—1 
and all y € A;, P(y, Aj) = 1 for 7 =i+1 (mod d); 

(b) the Markov chain is p-invariant in the sense that [ P(y, A)p(y)dv = 
J, ply)dv for all A € By; 

(c) the Markov chain is p-irreducible in the sense that for any y € Y and any 
A with f A p(y)dv > 0, there exists a positive integer t such that P‘(y, A) 
in (4.14) is ae and 

(d) the Markov chain is Harris recurrent in the sense that for any A with 
J, p(yjdv > 0, P(E, LAY™) = o|/YO =y) =1forally. 1 


The proof of these results is beyond the scope of this book and, hence, is 
omitted. It can be found, for example, in Nummelin (1984), Chan (1993), 
and Tierney (1994). A homogeneous Markov chain satisfying conditions 
(a)-(d) in Theorem 4.4 is called ergodic with equilibrium distribution p. 
Result (4.13) means that the MCMC approximation is consistent and result 
(4.14) indicates that p is the limiting p.d.f. of the Markov chain. 


One of the key issues in MCMC is the choice of the kernel P(y, A). The 
first requirement on P(y, A) is that conditions (a)-(d) in Theorem 4.4 be 
satisfied. Condition (a) is usually easy to check for any given P(y, A). In the 
following, we consider two popular MCMC methods satisfying conditions 


(a)-(d). 


Gibbs sampler 


One way to construct a p-invariant homogeneous Markov chain is to use 
conditioning. Suppose that Y has the p.d.f. p(y). Let Y; (or y;) be the ith 
component of Y (or y) and let Y_; (or y_;) be the (k— 1)-vector containing 
all components of Y (or y) except Y; (or y;). Then 


Pi(y-i, A) = P(Y € AlY_; = y-i) 


is a transition kernel for any 7. The MCMC method using this kernel is 
called the single-site Gibbs sampler. Note that 


/ P,(y-i, A)p(y)dv = E[P(Y € A|Y-;)] = P(Y € A) = ie p(y)dv 


and, therefore, the chain with kernel P;(y_;, A) is p-invariant. However, 
this chain is not p-irreducible since P;(y_;,-) puts all its mass on the set 
wv; (ys), where 7;(y) = y_i. Gelfand and Smith (1990) considered a sys- 
tematic scan Gibbs sampler whose kernel P(y, A) is a composite of k kernels 
P,(y-;, A), i =1,...,k. More precisely, the chain is defined as follows. Given 


e Cae 09 ye, : Ves (9 


YD) = y(t-1), we generate y;}” from P; + y;” from 


t t-l1 t-1 t (t 
P,(y?, .. ys ee ee , ‘ ee y? from P, (yo... Hs +). It can 


248 4, Estimation in Parametric Models 


be shown that this Markov chain is still p-invariant. We illustrate this with 
the case of k = 2. Note that yi is generated from Po(ys, -), the con- 
ditional distribution of Y given Y2 = yo? Hence (YO, x) has p.d.f. p. 
Similarly, we can show that Y“) = (vy, y£0) has p.d.f. p. Thus, 


[ Pu. Ady)av = free € AY = y)p(y)dv 


= E[P(Y € Aly) 
= P(Y € A) 


= A p(y)dv. 


This Markov chain is also p-irreducible and aperiodic if p(y) > 0 for all 
y € Y; see, for example, Chan (1993). Finally, if p(y) > 0 for all y € Y, 
then P(y, A) < the distribution with p.d.f. p for all y and, by Corollary 1 
of Tierney (1994), the Markov chain is Harris recurrent. Thus, Theorem 
4.4 applies and (4.13) and (4.14) hold. 


The previous Gibbs sampler can obviously be extended to the case where 
y;’8 are subvectors (of possibly different dimensions) of y. 


Let us now return to Bayesian computation and consider the following 
example. 


Example 4.10. Consider Example 4.9. Under the given prior for 6 = 
(3,w), it is difficult to generate random vectors directly from the posterior 
p.d.f., given X = x (which does not have a familiar form). To apply a 
Gibbs sampler with y = 6, y; = @, and y2 = w, we need to generate random 
vectors from the posterior of 3, given x and w, and the posterior of w, given 
x and 3. From (4.9) and (4.11), the posterior of w = (wj,...,wx), given x 
and 3, is a product of marginals of w;’s that are the gamma distributions 
T(a+1+n;/2,[y~! + vi(8)]~+), i = 1,...,&. Assume now that 7(3) = 1 
(noninformative prior for 3). It follows from (4.9) that the posterior p.d-f. 
of @, given x and w, is proportional to 


k 
II Bree Ay gee ae we ex | 


i=l 


where W is the diagonal block matrix whose ith block on the diagonal 
is wiIn,. Let n = are n;. Then, the posterior of W!/?Z6, given X 
and w, is N,(W‘/?X,2~'I,) and the posterior of 3, given X and w, is 
N,((Z7WZ)!Z7WX,27-'(Z7WZ)-") (Z7WZ is assumed of full rank for 
simplicity), since 6 = [((Z7WZ)-!Z7W!/?]W1/2Z8. Note that random 
generation using these two posterior distributions is fairly easy. I 


4.1. Bayes Decisions and Estimators 249 


The Metropolis algorithm 


A large class of MCMC methods are obtained using the Metropolis al- 
gorithm (Metropolis et al., 1953). We introduce Hastings’ version of the 
algorithm. Let Q(y, A) be a transition kernel of a homogeneous Markov 
chain satisfying 


Q(y, A) = [ a.arte) 


for a measurable function q(y, z) > 0 on Y x Y and a o-finite measure v on 
(Y, By). Without loss of generality, assume that fy p(y)dv = 1 and that p 
is not concentrated on a single point. Define 


«J p(zja(z,y) 
a(y, 2) = vase i i} P(y)aly, 2) > 0 
, ( 0 


and 
pee) ={ es as 


The Metropolis kernel P(y, A) is defined by 
P(A) = f ply, 2)dv(2) + rlu)8y(A), (4.15) 


where r(y) = 1— f p(y, z)dv(z) and 6, is the point mass at y defined in 
(1.22). The corresponding Markov chain can be described as follows. If the 
chain is currently at a point Y = y, then it generates a candidate value 
z for the next location Y“+) from Q(y,-). With probability a(y,z), the 
chain moves to Y(+) = z. Otherwise, the chain remains at Y+) = y, 


Note that this algorithm only depends on p(y) through p(y)/p(z). Thus, 
it can be used when p(y) is known up to a normalizing constant, which often 
occurs in Bayesian analysis. 


We now show that a Markov chain with a Metropolis kernel P(y, A) is 
p-invariant. First, by the definition of p(y, z) and a(y, z), 


P(y)P(y, 2) = p(z)p(z, y) 


for any y and z. Then, for any A € By, 
[Pua = | | i plu, 2)av(2)| ply)dety) + f r(y)¥y(A)p(y)dely) 
ah | [ow -\wlu)dr(y) dv(2) + | r(uo(o)drly) 


i 
= [| [oe nw arn| aor + [ rnntnyarta) 


250 4, Estimation in Parametric Models 


= | [1 — r(2)bp(z)dv(2) + | r(2)p(2)dv(2) 
A A 


= I OTR 


If a Markov chain with a Metropolis kernel defined by (4.15) is p- 
irreducible and J. (y)>0 PY y)dv > 0, then, by the results of Nummelin (1984, 
§2.4), the chain is aperiodic; by Corollary 2 of Tierney (1994), the chain is 
Harris recurrent. Hence, to apply Theorem 4.4 to a Markov chain with a 
Metropolis kernel, it suffices to show that the chain is p-irreducible. 


Lemma 4.2. Suppose that Q(y, A) is the transition kernel of a p-irreducible 
Markov chain and that either q(y, z) > 0 for all y and z or q(y, z) = ¢(z, y) 
for all y and z. Then the chain with the Metropolis kernel p(y, A) in (4.15) 
is p-irreducible. 

Proof. It can be shown (exercise) that if Q is any transition kernel of a 
homogeneous Markov chain, then 


= ff [Tae 541; 2n—7)d(2n—4), (4.16) 


where z, = y, y € Y, and A € By. ae A€ By with [, p(z)dv > 0, 
and By = {27 aly, 2) = 1}, If fanpe Pl z)dv > 0, then 


feu) a0, 
P(y, A) > I spp Mes 2) aC 2402) = ‘i soy ayy 2) > 


which follows from either q(z,y) > 0 or q(z,y) = a(y,z) > 0 on By. If 
Junge P(z)dv = 0, then Jane, p(z)dv > 0. From the irreducibility of 


Q(y, A), there exists a t > 1 such that Q*(y,AM By) > 0. Then, by 
(4.15) and (4.16), 


P*(y, A) 2 P*(y, AN By) 2 Q*(y, AN By) > 0. w 


Two examples of q(y, z) given by Tierney (1994) are q(y, z) = f(z — y) 
with a Lebesgue p.d.f. f on R*, which corresponds to a random walk chain, 
and q(y,z) = f(z) with a p.d-f. f, which corresponds to an independence 
chain and is closely related to the importance sampling discussed earlier. 


Although the MCMC methods have been used over the last 50 years, 
the research on the theory of MCMC is still very active. Important top- 
ics include the choice of the transition kernel for MCMC; the rate of the 
convergence in (4.13); the choice of the Monte Carlo size m; and the esti- 
mation of the errors due to Monte Carlo. See more results and discussions 
in Tierney (1994), Basag et al. (1995), Tanner (1996), and the references 
therein. 


4.2. Invariance 251 


4.2 Invariance 


The concept of invariance is introduced in §2.3.2 (Definition 2.9). In this 
section, we study the best invariant estimators and their properties in 
one-parameter location families (§4.2.1), in one-parameter scale families 
(§4.2.2), and in general location-scale families (§4.2.3). Note that invariant 
estimators are also called equivariant estimators. 


4.2.1 One-parameter location families 


Assume that the sample X = (X1,...,X,,) has a joint distribution P,, with 
a Lebesgue p.d.f. 


f(a — Hy --; Un — Bb), (4.17) 


where f is known and p € RF is an unknown location parameter. The family 
P ={P, : » © R} is called a one-parameter location family, a special case of 
the general location-scale family described in Definition 2.3. It is invariant 
under the location transformations g-(X) = (X1 + ¢,..., Xn +c¢),cE R. 

We consider the estimation of yz as a statistical decision problem with 
action space A = R and loss function L(yu,a). It is natural to consider 
the same transformation in the action space, i.e., if X; is transformed to 
X;+c, then our action a is transformed to a+c. Consequently, the decision 
problem is invariant under location transformation if and only if 


L(y,a) = L(+c,a+c) for allc€ R, 
which is equivalent to 
L(y,a) = L(a— p) (4.18) 
for a Borel function L(-) on R. 
According to Definition 2.9 (see also Example 2.24), an estimator T 
(decision rule) of yu is location invariant if and only if 


E (Xie Xo PO) = TX pan Xn) ees (4.19) 


Many estimators of , such as the sample mean and weighted average of 
the order statistics, are location invariant. The following result provides a 
characterization of location invariant estimators. 


Proposition 4.3. Let Jo be a location invariant estimator of py. Let 
di; = Uj; —%, ti = 1,...,n—1, and d = (d1,...,dn_1). A necessary and 
sufficient condition for an estimator T to be location invariant is that there 
exists a Borel function u on R”~! (u = a constant if nm = 1) such that 


T(x) = To(x) — u(d) for alla ER”. (4.20) 


252 4, Estimation in Parametric Models 


Proof. It is easy to see that T given by (4.20) satisfies (4.19) and, therefore, 
is location invariant. Suppose that T is location invariant. Let u(x) = 
T(x) — To(x) for any  € R”. Then 
i(ay + ¢,...,%n +0) = T(a1 +6¢,...,¢n +c) — To(t1 + 6,...,%n +c) 
— T(«1, ae Ln) or To(«1, see Ln) 


for allc € R and x; € R. Putting c = —z,, leads to 
t(21 — Ln,.--)2n—-1 — In, 0) =T(x)—-To(x), cER”. 


The result follows with u(dj,...,dn—1) = U(@1 — Un,.-,¥n-1 — Un, 0). I 


Therefore, once we have a location invariant estimator To of u, any 
other location invariant estimator of can be constructed by taking the 
difference between Jp and a Borel function of the ancillary statistic D = 
(X1 — Xn,..., Xn-1 — Xn). 

The next result states an important property of location invariant esti- 
mators. 


Proposition 4.4. Let X be distributed with the p.d.f. given by (4.17) and 
let T be a location invariant estimator of js under the loss function given 
by (4.18). If the bias, variance, and risk of T are well defined, then they 
are all constant (do not depend on j). 

Proof. The result for the bias follows from 


br(H) = [rere — Py.) 2n — p)de — pb 
= [re + My, 2n + pw) f(x)dax — ys 
= fire) + Wlta)ae ~ p 
= / T (2) f (de. 


The proof of the result for variance or risk is left as an exercise. 


An important consequence of this result is that the problem of finding 
the best location invariant estimator reduces to comparing constants in- 
stead of risk functions. The following definition can be used not only for 
location invariant estimators, but also for general invariant estimators. 


Definition 4.2. Consider an invariant estimation problem in which all 
invariant estimators have constant risks. An invariant estimator T is called 
the minimum risk invariant estimator (MRIE) if and only if T has the 
smallest risk among all invariant estimators. 


4.2. Invariance 253 


Theorem 4.5. Let X be distributed with the p.d.f. given by (4.17) and 
consider the estimation of 4 under the loss function given by (4.18). Sup- 
pose that there is a location invariant estimator To of jz with finite risk. 
bee De Ss oa), 

(i) Assume that for each d there exists a u.(d) that minimizes 


h(d) = Eo[L(To(X) — u(d))|D = d] 


over all functions u, where the expectation Eo is calculated under the as- 
sumption that X has p.d.f. f(a1,...,%,). Then an MRIF exists and is given 
by 

T,(X) = To(X) — ux(D). 
(ii) The function uw, in (i) exists if L(t) is convex and not monotone; it is 
unique if D is strictly convex. 
(iii) If T> and D are independent, then u, is a constant that minimizes 
Eo[L(To(X)—u)]. If, in addition, the distribution of Tp is symmetric about 
p and LF is convex and even, then u, = 0. 
Proof. By Theorem 1.7 and Propositions 4.3 and 4.4, 


Rer(u) = Eo[h(D)], 
where T(X) = To(X) —u(D). This proves part (i). If L is (strictly) convex 
and not monotone, then Eo{L(To(x)—a)|D = d] is (strictly) convex and not 


monotone in a (exercise). Hence limjq|—o0 Eo[L(To(x) — a)|D = d] = ov. 
This proves part (ii). The proof of part (iii) is left as an exercise. I 


Theorem 4.6. Assume the conditions of Theorem 4.5 and that the loss is 
the squared error loss. 
(i) The unique MRIE of p is 


Jo tf( X41 — t,..., Xn — tdt 


) Jo F(X = t,.., Xn — tdt’ 


which is known as the Pitman estimator of w. 
(ii) The MRIE of yz is unbiased. 
Proof. (i) Under the squared error loss, 


usx(d) = Eo[To(X)|D = dj (4.21) 


(exercise). Let To(X) = X», (the nth observation). Then X,, is location 
invariant. If there exists a location invariant estimator of jz with finite risk, 
then Eo(X,|D = d) is finite a.s. P (exercise). By Proposition 1.8, when 
ps = 0, the joint Lebesgue p.d-f. of (D, Xn) is f(d1 + an,...,;dn—-1 + 2n,2n), 
d = (dj,...,dyn—1). The conditional p.d.f. of X, given D = d is then 


ee f(di + t, sont + t, t)dt 


254 4, Estimation in Parametric Models 


(see (1.61)). By Proposition 1.9, 


By(X«|D <a) J th(d +t,-,dn—1 +t, t)dt 
oa eae int]. ee eer ome 


eet Gi tet a Pe — HOE 
lie f(m1 — an +t,...,2n-1 — In +t, t)dt 


lee uf (a1 — U,...,0n — u)du 
= Lp —- SS 
Le F(ty =the = Uda 
by letting u = x, —t. The result in (i) follows from T,(X) = X,—-—E(X»,|D) 
(Theorem 4.5). 
(ii) Let 6 be the constant bias of T, (Proposition 4.4). Then T)(X) = 
T.(X) — 6 is a location invariant estimator of j and 


Rr, = E(T,(X) —b— py)? = Var(T.) < Var(T,) + b? = Rr,. 
Since T, is the MRIE, b = 0, i-e., J, is unbiased. I 


Theorem 4.6(ii) indicates that we only need to consider unbiased lo- 
cation invariant estimators in order to find the MRIE, if the loss is the 
squared error loss. In particular, a location invariant UMVUE is an MRIE. 


Example 4.11. Let X1,..., Xp be iid. from N(,07) with an unknown 
pt € R and a known o?. Note that X is location invariant. Since X is the 
UMVUE of pu (§2.1), it is the MRIE under the squared error loss. Since the 
distribution of X is symmetric about ys and X is independent of D (Basu’s 
theorem), it follows from Theorem 4.5(iii) that X is an MRIE if L is convex 
andeven. IJ 


Example 4.12. Let X),...,X, be i.i.d. from the exponential distribution 
E(u,), where 6 is known and p € FR is unknown. Since X(1) — 6/n is 
location invariant and is the UMVUE of p, it is the MRIE under the squared 
error loss. Note that X 1) is independent of D (Basu’s theorem). By 
Theorem 4.5(iii), an MRIE is of the form X(1) — us with a constant ux. 
For the absolute error loss, X(1) — @log2/n is an MRIE (exercise). 


Example 4.13. Let X),..., Xp, be iid. from the uniform distribution on 
(u — $,+ 4) with an unknown ys € R. Consider the squared error loss. 
Note that 


feet — FS 2a) Sem) Spt 
Saas 0 otherwise. 


4.2. Invariance 255 


By Theorem 4.6(i), the MRIE of y is 


‘ioe ean Mad doe 
r(x) = | ua f d= OO 
X(n)—3 X(n)-3 


We end this section with a brief discussion of the admissibility of MRIE’s 
in a one-parameter location problem. Under the squared error loss, the 
MRIE (Pitman’s estimator) is admissible if there exists a location invariant 
estimator Ty with E|To(X)|> < oo (Stein, 1959). Under a general loss 
function, an MRIE is admissible when it is a unique MRIE (under some 
other minor conditions). See Farrell (1964), Brown (1966), and Brown and 
Fox (1974) for further discussions. 


4.2.2 One-parameter scale families 


Assume that the sample X = (X1,..., Xn) has a joint distribution P, with 
a Lebesgue p.d.f. 
onf (4,..,%), (4.22) 


where f is known and o > 0 is an unknown scale parameter. The family 
P ={P,: 0 > 0} is called a one-parameter scale family and is a special 
case of the general location-scale family in Definition 2.3. This family is 
invariant under the scale transformations g,(X) =rX,r > 0. 

We consider the estimation of o” with A = [0,0o), where h is a nonzero 
constant. The transformation g, induces the transformation g,(a”) = ro". 
Hence, a loss function L is scale invariant if and only if 


L(ro,r”a) = L(o, a) for all r > 0, 


which is equivalent to 
L(o,a) = L (&) (4.23) 


o® 
for a Borel function L(-) on [0,00). An example of a loss function satisfying 


(4.23) is 


* . eo"? 
= (4.24) 


L(o,a) = a-1 


where p > 1 is a constant. However, the squared error loss does not satisfy 
(4.23). 


An estimator T of o” is scale invariant if and only if 
T(rX1, ..,7Xn) = 7° T(X1, «5 Xn): 


Examples of scale invariant estimators are the sample variance $? (for h = 
2), the sample standard deviation S = VS? (for h = 1), the sample range 


256 4, Estimation in Parametric Models 


Xn) — X(1) (for h = 1), and the sample mean deviation n~* S07, |X; — X| 
(for h = 1). 

The following result is an analogue of Proposition 4.3. Its proof is left 
as an exercise. 


Proposition 4.5. Let To be a scale invariant estimator of o”. A necessary 
and sufficient condition for an estimator T to be scale invariant is that there 
exists a positive Borel function u on ?” such that 


T (a) = To(x)/u(z) for alla eR”, 
where z = (21,...,2n), 21 = Ui /@m,t=1,...,.n—-1, and zp =2n/|ep|. I 


The next result is similar to Proposition 4.4. It applies to any invariant 
problem defined in Definition 2.9. We use the notation in Definition 2.9. 


Theorem 4.7. Let P be a family invariant under G (a group of transfor- 
mations). Suppose that the loss function is invariant and T is an invariant 
decision rule. Then the risk function of T is a constant. UW 


The proof is left as an exercise. Note that a special case of Theorem 4.7 
is that any scale invariant estimator of a” has a constant risk and, therefore, 
an MRIE (Definition 4.2) of 7” usually exists. However, Proposition 4.4 
is not a special case of Theorem 4.7, since the bias of a scale invariant 
estimator may not be a constant in general. For example, the bias of the 
sample standard deviation is a function of o. 


The next result and its proof are analogues of those of Theorem 4.5. 


Theorem 4.8. Let X be distributed with the p.d-f. given by (4.22) and 
consider the estimation of 7” under the loss function given by (4.23). Sup- 
pose that there is a scale invariant estimator Tp of o” with finite risk. Let 
Z =(Zay.4Zn) with Zj = XifXn, i= 1,...4n—1, and Zp = Xn/|Xal. 

(i) Assume that for each z there exists a u.(z) that minimizes 


Ey [L(To(X)/u(z))|4 = 2] 


over all positive Borel functions u, where the conditional expectation FE} is 
calculated under the assumption that X has p.d.f. f(a1,...,@,). Then, an 
MRIE exists and is given by 


T.(X) = To(X)/us(Z). 


(ii) The function w,. in (i) exists if y(t) = L(e‘) is convex and not monotone; 
it is unique if y(t) is strictly convex. 


4.2. Invariance 257 


The loss function given by (4.24) satisfies the condition in Theorem 
4.8(ii). A loss function corresponding to the squared error loss in this 
problem is the loss function (4.24) with p = 2. We have the following result 
similar to Theorem 4.6 (its proof is left as an exercise). 


Corollary 4.1. Under the conditions of Theorem 4.8 and the loss function 
(4.24) with p = 2, the unique MRIE of a” is 


_ To X)Es(To(X)|Z] _ Apacs ACP Signe Orne 


TOO = PTTO(XEIZ) fA FEN, EX, db” 


which is known as the Pitman estimator of 0”. I 


Example 4.14. Let X1,...,X;, be i.i.d. from N(0, 07) and consider the es- 
timation of o?. Then Ty = )7_, X? is scale invariant. By Basu’s theorem, 
To is independent of Z. Hence u, in Theorem 4.8 is a constant minimizing 
E\[L(Zo/u)| over u > 0. When the loss is given by (4.24) with p = 2, by 
Corollary 4.1, the MRIE (Pitman’s estimator) is 


_ Tl(X)FITA(X)| 2 
BOD=mimane ~n+32y* 


since Tp has the chi-square distribution y2 when o = 1. Note that the 
UMVUE of o? is To/n, which is different from the MRIE. 


Example 4.15. Let X),...,X, be iid. from the uniform distribution on 
(0,0) and consider the estimation of o. By Basu’s theorem, the scale in- 
variant estimator X(,,) is independent of Z. Hence u, in Theorem 4.8 is a 
constant minimizing E\[L(X(n)/u)] over u > 0. When the loss is given by 
(4.24) with p = 2, by Corollary 4.1, the MRIE (Pitman’s estimator) is 


X (n) E1X (n) - (n+ 2)X(n) 


E\XZy ntl 


T.(X)= 


4.2.3. General location-scale families 


Assume that X = (Xj,..., Xn) has a joint distribution Py with a Lebesgue 
p.d.f. 


es (4.25) 


where f is known, 6 = (4,0) € 8, and O = R x (0,00). The family 
P = {Po : 0 € O} is a location-scale family defined by Definition 2.3 and 
is invariant under the location-scale transformations of the form g¢,,(X) = 
(rX1+¢,..,7TXn +c), c€ R, r > 0, which induce similar transformations 
on O: ger(0) = (rut+e,ra), cE R, r > 0. 


258 4, Estimation in Parametric Models 


Consider the estimation of 7” with a fixed h 4 0 under the loss function 
(4.23), which is invariant under the location-scale transformations g.,-. An 
estimator T of a” is location-scale invariant if and only if 


T(r7Xy + 6, ..57Xn +e) Sr"T (Xi, 05 Xn): (4.26) 


By Theorem 4.7, any location-scale invariant T has a constant risk. Letting 
r = 1 in (4.26), we obtain that 


POG eRe feel = TOG kA) 


for allc € R. Therefore, T is a function of D = (Dj,...,Dn—-1), Di = 
X;,— Xn, t=1,...,n—1. From (4.25), the joint Lebesgue p.d.f. of D is 


Hf (4 + by... HE + t,t) dt, (4.27) 


which is of the form (4.22) with n replaced by n—1 and x;’s replaced by d;’s. 
It follows from Theorem 4.8 that if Tp(D) is any finite risk scale invariant 
estimator of a” based on D, then an MRIE of a” is 


T,(D) =To(D)/u.(W), (4.28) 


where W = (W1,...,; Wn-1); W; = D;/Dn-1, a = Aare — 2, Wn-1 = 
Dn—1/|Dn—1|, Ux(w) is any number minimizing E)(L(To(D)/u(w))|W = w] 
over all positive Borel functions u, and E, is the conditional expectation 
calculated under the assumption that D has p.d.f. (4.27) with o = 1. 


Consider next the estimation of 4. Under the location-scale transfor- 
mation ge,r, it can be shown (exercise) that a loss function is invariant if 
and only if it is of the form 

L (=). (4.29) 


An estimator T of yz is location-scale invariant if and only if 
T(rX1 +¢,..,7Xn +06) =rT(X1,..., Xn) +. 
Again, by Theorem 4.7, the risk of an invariant T is a constant. 


The following result is an analogue of Proposition 4.3 or 4.5. 


Proposition 4.6. Let 7p be any estimator of 4 invariant under location- 
scale transformation and let T; be any estimator of o satisfying (4.26) with 
h=1and T; > 0. Then an estimator T of uz is location-scale invariant if 
and only if there is a Borel function u on R"~* such that 


T(X) = To(X) — u(W)Ti(X), 


where W is given in (4.28). I 


4.2. Invariance 259 


The proofs of Proposition 4.6 and the next result, an analogue of The- 
orem 4.5 or 4.8, are left as exercises. 


Theorem 4.9. Let X be distributed with p.d.f. given by (4.25) and con- 
sider the estimation of j under the loss function given by (4.29). Suppose 
that there is a location-scale invariant estimator Tp of 42 with finite risk. 
Let T; be given in Proposition 4.6. Then an MRIE of wp is 


T.(X) = To(X) — we(W)Ti(X), 
where W is given in (4.28), us(w) is any number minimizing 
Foi[L(To(X) — u(w)Ti(X))|W = wv] 


over all Borel functions u, and Eo,; is computed under the assumption that 
X has the p.d.f. (4.25) with w~=Oando=1. I 


Corollary 4.2. Under the conditions of Theorem 4.9 and the loss function 
(a — )?/07, ux(w) in Theorem 4.9 is equal to 


igs Dal COROCOW Sal 
* Eoa{[Ti(X)}?|W = w} ‘ 


Example 4.16. Let X1,..., Xp be ii.d. from N(p,07), where w € R and 
o” > 0 are unknown. Consider first the estimation of a? under loss function 
(4.23). The sample variance S? is location-scale invariant and is indepen- 
dent of W in (4.28) (Basu’s theorem). Thus, by (4.28), 9?/u, is an MRIE, 
where u, is a constant minimizing £,[L(S?/u)] over all u > 0. If the loss 
function is given by (4.24) with p = 2, then by Corollary 4.1, the MRIE of 


o? is 


S?E,(S?) S? ae an 
(Oe at a NY Se: 

Oa Far = ane aei 
since (n — 1)$? has a chi-square distribution y?_, when o = 1. 
_ Next, consider the estimation of 4 under the loss function (4.29). Since 
X is a location-scale invariant estimator of 4 and is independent of W in 


(4.28) (Basu’s theorem), by Theorem 4.9, an MRIE of pu is 
T.(X) =X —u,S?, 


where u, is a constant. If L in (4.29) is convex and even, then u, = 0 (see 
Theorem 4.5(iii)) and, hence, X isan MRIE of yw. 0 


Example 4.17. Let X),....X, be iid. from the uniform distribution on 
(u — 40, 4+ 50), where pp € R and o > 0 are unknown. Consider first the 


260 4, Estimation in Parametric Models 


estimation of o under the loss function (4.24) with p = 2. The sample range 
Xn) — Xa) is a location-scale invariant estimator of o and is independent 
of W in (4.28) (Basu’s theorem). By (4.28) and Corollary 4.1, the MRIE 
of a is 


r(x) - Xo = Xqy)Es(X = Xap) _ (1 +2)(Xwy = Xe) 
Fi (X(ny — Xqay)? y 


Consider now the estimation of y under the loss function (4.29). Since 
(X—1) + X(m))/2 is a location-scale invariant estimator of y and is inde- 
pendent of W in (4.28) (Basu’s theorem), by Theorem 4.9, an MRIE of yu 
is 

xX y+ xX n 
T,(X) = a — Ux(X(n) — Xa), 
where u, is a constant. If Z in (4.29) is convex and even, then u. = 0 (see 
Theorem 4.5(iii)) and, hence, (X(1) + X(,))/2 is an MRIE of pw. 0 


Finding MRIE’s in various location-scale families under transformations 
AX +c, where A € T andc € C with given TJ and C, can be done in a similar 
way. We only provide some brief discussions for two important cases. The 
first case is the two-sample location-scale problem in which two samples, 
X = (X,..., Xm) and Y = (¥1,...,Y,), are taken from a distribution with 
Lebesgue p.d.f. 


i =bMe mole Yeh Yn—- 

al (265 taste, Hy, Yo=t), (4.30) 
where f is known, wz € R and pt, € R are unknown location parameters, 
and o, > 0 and oy, > O are unknown scale parameters. The family of 
distributions is invariant under the transformations 


X,Y) =(r7Xi+6..47rXmtogrNte,..°Y,+¢), 4.31 
g 


where r > 0, r° > 0,c € R, and c' € R. The parameters to be estimated 
in this problem are usually A = py — Wz and 7 = (o,/o2)" with a fixed 
h#0. If X and Y are from two populations, A and 7 are measures of the 
difference between the two populations. For estimating 7, results similar to 
those in this section can be established. For estimating A, MRIE’s can be 
obtained under some conditions. See Exercises 63-65. 


The second case is the general linear model (3.25) under the assumption 
that ¢;’s are iid. with the p.d-f. 07! f(x/o), where f is a known Lebesgue 
p.d.f. The family of populations is invariant under the transformations 


g(X)=rX+Ze, re (0,00), c€ RP (4.32) 


4.3. Minimaxity and Admissibility 261 


(exercise). The estimation of 176 with 1 € R(Z) is invariant under the 
loss function D (2) and the LSE 17@ is an invariant estimator of I76 


(exercise). When f is normal, the following result can be established using 
an argument similar to that in Example 4.16. 


Theorem 4.10. Consider model (3.25) with assumption A1. 
(i) Under transformations (4.32) and the loss function DL (=), where L 


is convex and even, the LSE ITB is an MRIE of /7G for any 1 € R(Z). 

(ii) Under transformations (4.32) and the loss function (a — 07)?/o*, the 
MRIE of o? is SSR/(n—r +2), where SSR is given by (3.35) and r is the 
rank of Z. 


MRIE’s in a parametric family with a multi-dimensional @ are often 
inadmissible. See Lehmann (1983, p. 285) for more discussions. 


4.3. Minimaxity and Admissibility 


Consider the estimation of a real-valued J = g(0) based on asample X from 
Po, 0 € ©, under a given loss function. A minimax estimator minimizes the 
maximum risk supgee Rr() over all estimators T (see §2.3.2). 


A unique minimax estimator is admissible, since any estimator better 
than a minimax estimator is also minimax. This indicates that we should 
consider minimaxity and admissibility together. The situation is different 
for a UMVUE (or an MRIB), since if a UMVUE (or an MRIE) is inadmis- 
sible, it is dominated by an estimator that is not unbiased (or invariant). 


4.3.1 Estimators with constant risks 


By minimizing the maximum risk, a minimax estimator tries to do as well 
as possible in the worst case. Such an estimator can be very unsatisfactory. 
However, if a minimax estimator has some other good properties (e.g., it is 
a Bayes estimator), then it is often a reasonable estimator. Here we study 
when estimators having constant risks (e.g., MRIE’s) are minimax. 


Theorem 4.11. Let II be a proper prior on © and 6 be a Bayes estimator 
of 0 w.r.t. II. Let Og = {6 : Rs (9) = supgeo Rs (O)}. If (On) = 1, then 6 
is minimax. If, in addition, 6 is the unique Bayes estimator w.r.t. I, then 
it is the unique minimax estimator. 

Proof. Let T be any other estimator of ?. Then 


sup Rr(@) > Rr(0)dIl > R5(0)dII = sup R5(9). 
0EO On On EO 


262 4, Estimation in Parametric Models 


If 6 is the unique Bayes estimator, then the second inequality in the previous 
expression should be replaced by > and, therefore, 6 is the unique minimax 
estimator. I 


The condition of Theorem 4.11 essentially means that 6 has a constant 
risk. Thus, a Bayes estimator having constant risk is minimax. 


Example 4.18. Let Xj,...,X, be iid. binary random variables with 
P(X, = 1) =p€ (0,1). Consider the estimation of p under the squared er- 
ror loss. The UMVUE X has risk p(1—p)/n which is not constant. In fact, 
X is not minimax (Exercise 67). To find a minimax estimator by applying 
Theorem 4.11, we consider the Bayes estimator w.r.t. the beta distribution 
B(a, 8) with known a and @ (Exercise 1): 


6(X) =(a+nX)/(a+ +n). 
A straightforward calculation shows that 


Rs(p) = [np(1 — p) + (a — ap — Bp)*|/(a+ B +n)’. 


To apply Theorem 4.11, we need to find values of a > 0 and 2 > 0 such 
that R5(p) is constant. It can be shown that Rs(p) is constant if and only 
if a = 6B = \/n/2, which leads to the unique minimax estimator 


T(X) = (nX + Vn/2)/(n + Va. 


The risk of T is Rr = 1/[4(1 + /n)?]. 
Note that T is a Bayes estimator and has some good properties. Com- 
paring the risk of T with that of X, we find that T has smaller risk if and 


only if 
pe (4-3/l oer bth T ee) 8) 


Thus, for a small n, T is better (and can be much better) than X for most 
of the range of p (Figure 4.1). When n — oo, the interval in (4.33) shrinks 
toward $. Hence, for a large (and even moderate) n, X is better than T 
for most of the range of p (Figure 4.1). The limit of the asymptotic relative 
efficiency of T w.r.t. X is 4p(1 — p), which is always smaller than 1 when 
pF 4 and equals 1 when p = 4. 

The minimax estimator depends strongly on the loss function. To see 
this, let us consider the loss function L(p, a) = (a—p)?/{p(1—p)]. Under this 
loss function, X has constant risk and is the unique Bayes estimator w.r.t. 
the uniform prior on (0,1). By Theorem 4.11, X is the unique minimax 
estimator. On the other hand, the risk of T is equal to 1/[4(1+./n)*p(1—p)], 
which is unbounded. IJ 


4.3. Minimaxity and Admissibility 263 


0.30 


mse 
0.20 


0.10 


0.0 


n=9 n=16 


mse 
1 


Figure 4.1: mse’s of X (curve) and T(X) (straight line) in Example 4.18 


In many cases a constant risk estimator is not a Bayes estimator (e.g., 
an unbiased estimator under the squared error loss), but a limit of Bayes 
estimators w.r.t. a sequence of priors. Then the following result may be 
used to find a minimax estimator. 


Theorem 4.12. Let II,;, 7 = 1,2,..., be a sequence of priors and r; be the 
Bayes risk of a Bayes estimator of  w.r.t. Ij. Let J’ be a constant risk 
estimator of v. If liminf;r; > Rr, then T is minimax. I 


The proof of this theorem is similar to that of Theorem 4.11. Although 
Theorem 4.12 is more general than Theorem 4.11 in finding minimax esti- 
mators, it does not provide uniqueness of the minimax estimator even when 
there is a unique Bayes estimator w.r.t. each II;. 


In Example 2.25, we actually applied the result in Theorem 4.12 to show 
the minimaxity of X as an estimator of p= EX, when Xq,..., Xn are iid. 
from a normal distribution with a known o? = Var(X1), under the squared 
error loss. To discuss the minimaxity of X in the case where g? is unknown, 
we need the following lemma. 


264 4, Estimation in Parametric Models 


Lemma 4.3. Let Og be a subset of O and T be a minimax estimator of J 
when Oo is the parameter space. Then T is a minimax estimator if 
sup Rr(0) = sup Rr(9). 
0EO 0€Oo 
Proof. If there is an estimator Ty with supgceg Rn (9) < supgee Rr(9), 
then 
sup Rr (9) < sup Rr, (9) < sup Rr(9) = sup Rr(9), 
IEC 0c0 0c0 I€Oo 
which contradicts the minimaxity of T when 09 is the parameter space. 
Hence, T' is minimax when 9 is the parameter space. I 


Example 4.19. Let X1,..., Xp be iid. from N(y, 07) with unknown 0 = 
(u,07). Consider the estimation of js under the squared error loss. Suppose 
first that O = R x (0,c] with a constant c > 0. Let 09 = R x {c}. From 
Example 2.25, X is a minimax estimator of 4. when the parameter space 
is Og. An application of Lemma 4.3 shows that X is also minimax when 
the parameter space is @. Although o? is assumed to be bounded by c, the 
minimax estimator X does not depend on c. 


2 


Consider next the case where 0 = R x (0,00), i-e., 07% is unbounded. 


Let T be any estimator of yu. For any fixed 07, 


o2 


— < sup Rr(9), 
n ER 

since o?/n is the risk of X that is minimax when o? is known (Example 

2.25). Letting 0? — oo, we obtain that supg Rr(@) = oo for any estimator 

T. Thus, minimaxity is meaningless (any estimator is minimax). 


Theorem 4.13. Suppose that T as an estimator of J has constant risk and 
is admissible. Then T' is minimax. If the loss function is strictly convex, 
then T is the unique minimax estimator. 

Proof. By the admissibility of T, if there is another estimator To with 
supg Rr (0) < Rr, then Rr, (0) = Rr for all 6. This proves that T is 
minimax. If the loss function is strictly convex and Tp is another minimax 
estimator, then 


Rer+m)/2(9) << (Rr, + Rr)/2 =Rr 


for all 6 and, therefore, T’ is inadmissible. This shows that T is unique if 
the loss is strictly convex. I 


Combined with Theorem 4.7, Theorem 4.13 tells us that if an MRIE is 
admissible, then it is minimax. From the discussion at the end of §4.2.1, 
MRIE’s in one-parameter location families (such as Pitman’s estimators) 
are usually minimax. 


4.3. Minimaxity and Admissibility 265 


4.3.2 Results in one-parameter exponential families 


The following result provides a sufficient condition for the admissibility of 
a class of estimators when the population Pg is in a one-parameter expo- 
nential family. Using this result and Theorem 4.13, we can obtain a class 
of minimax estimators. The proof of this result is an application of the 
information inequality introduced in §3.1.3. 


Theorem 4.14. Suppose that X has the p.d.f. c(@)e°?) w.r.t. a o-finite 
measure v, where T(x) is real-valued and 6 € (6_,6,) C R. Consider the 
estimation of ) = E/T (X)] under the squared error loss. Let A > 0 and 
be known constants and let T),,(X) = (T+ yA)/(1+A). Then a sufficient 
condition for the admissibility of T) , is that 


O4 enV 90 e789 
a | ep = oe. (4.34) 
: [c(9)} ~ [e(@))r 
where 0 € (6_, 6+). 
Proof. From Theorem 2.1, 9 = E[T'(X)] = —c'(0)/c(0) and 4 = Var(T) = 


I(0), the Fisher inform ion defined in (3.5). Suppose that there is an 
estimator 6 of J such that for all 6, 


R5(0) < Ray, (0) = [1(8) + °(08 — ¥)?]/(1 + A)?. 
Let bs(0) be the bias of 6. From the information inequality (3.6), 
R5(9) > [bs(8)]? + (108) + 65 (0)? /1(8). 
Let h(@) = b5(0) — A(y — V)/(1 +A). Then 


2An(@(O— 7) +200), HOOP 
tana I(@) 


[A(9)}* — 


which implies 

2Ah(0) (0 — 2h'(@ 

ENO en ee, (4.35) 
1+A 


Let a(0) = h(0)[c(@))>e%?. Differentiation of a(@) reduces (4.35) to 


[n(9)}° — 


[a(9)|?e7?  2a’(9) 
ic(@®P T+a 


<0. (4.36) 


Suppose that a(09) < 0 for some 09 € (6_, 64). From (4.36), a’(6) < 0 for 
all 0. Hence a(@) < 0 for all 6 > 09 and, for 6 > 00, (4.36) can be written 


as 
d [1]. (4+2re- 
5 law? Tee 


266 4, Estimation in Parametric Models 


Integrating both sides from 69 to 6 gives 


me Bie 1 1 1 
fa oS a0) a0) - 


Letting 6 — 64, the left-hand side of the previous expression diverges to oo 
by condition (4.34), which is impossible. This shows that a(@) > 0 for all 0. 
Similarly, we can show that a(@) < 0 for all 8. Thus, a(@) = 0 for all 6. This 
means that h(@) = 0 for all 6 and 65(0) = —Av’/(1 + A) = —AI(8)/(1 +), 
which implies Rs(@) = Rr,_,(@). This proves the admissibility of Ty. 


The reason why 7}, is considered is that it is often a Bayes estimator 
w.r.t. some prior; see, for example, Examples 2.25, 4.1, 4.7, and 4.8. To 
find minimax estimators, we may use the following result. 


Corollary 4.3. Assume that X has the p.d.f. as described in Theorem 
4.14 with 6- = —oo and 04 =~. 

(i) As an estimator of 0 = E(T), T(X) is admissible under the squared 
error loss and the loss (a — 9)?/Var(T). 

(ii) T is the unique minimax estimator of 9 under the loss (a— 9)?/Var(T). 
Proof. (i) With A = 0, condition (4.34) is clearly satisfied. Hence, Theorem 
4.14 applies under the squared error loss. The admissibility of T under the 
loss (a — #)?/Var(T) follows from the fact that T is admissible under the 
squared error loss and Var(T) 4 0. 

(ii) This is a consequence of part (i) and Theorem 4.13. 0 


Example 4.20. Let X1,..., Xn be iid. from N(0,07) with an unknown 
o* >0. Let Y = SOL, X?. From Example 4.14, Y/(n+2) is the MRIE of o? 
and has constant risk under the loss (a — 0?)?/a+. We now apply Theorem 
4.14 to show that Y/(n+ 2) is admissible. Note that the joint p.d.f. of X;’s 
is of the form c(@)e®?) with @ = —n/(40?), c(@) = (—20/n)"/?, T(X) = 
2Y/n, 9 = —oo, and 6, =0. By Theorem 4.14, T), = (T+ yA)/(1 + A) 
is admissible under the squared error loss if 


Ladies ro —20 ee : AD Q—NA/2 
e 7 — dé = ere dd =a 
cae n 0 


for some c > 0. This means that T),, is admissible if y = 0 and \ = 2/n, or 
if y > Oand A > 2/n. In particular, 2Y/(n+ 2) is admissible for estimating 
E(T) = 2E(Y)/n = 207, under the squared error loss. It is easy to see that 
Y/(n + 2) is then an admissible estimator of o? under the squared error 
loss and the loss (a — o”)?/o*. Hence Y/(n + 2) is minimax under the loss 
(a — 07)? /o%. 


Note that we cannot apply Corollary 4.3 directly since 6, =0. JI 


4.3. Minimaxity and Admissibility 267 


Example 4.21. Let X1,...,X;, be i.i.d. from the Poisson distribution P(@) 
with an unknown @ > 0. The joint p.d.f. of X;’s w.r.t. the counting measure 
is (ay!+++ap!)~he~ Mer !089, For n = nlog@, the conditions of Corollary 
4.3 are satisfied with T(X) = X. Since E(T) = @ and Var(T) = 0/n, 
by Corollary 4.3, X is the unique minimax estimator of 6 under the loss 
function (a—6)?/0. Wl 


4.3.3 Simultaneous estimation and shrinkage estimators 


In this chapter (and most of Chapter 3) we have focused on the estimation 
of a real-valued 7. The problem of estimating a vector-valued ? under the 
decision theory approach is called simultaneous estimation. Many results 
for the case of a real-valued V can be extended to simultaneous estimation 
in a straightforward manner. 


Let J be a p-vector of parameters (functions of @) with range 0. A 
vector-valued estimator TX) can be viewed as a decision rule taking values 
in the action space A = ©. Let L(6,a) be a given nonnegative loss function 
on O x A. A natural generalization of the squared error loss is 


Pp 


L(6,a) = |la— 9||? = S>(ai — 84)’, (4.37) 


i=1 


where a; and J; are the ith components of a and ¥, respectively. 


A vector-valued estimator T is called unbiased if and only if E(T) = 0 
for all 6 € ©. If there is an unbiased estimator of J, then V is called 
estimable. It can be seen that the result in Theorem 3.1 extends to the 
case of vector-valued 3 with any L strictly convex in a. If the loss function 
is given by (4.37) and T; isa UMVUE of 0; for each i, then T = (74,..., Tp) 
is a UMVUE of ¥. If there is a sufficient and complete statistic U(X) for 
0, then by Theorem 2.5 (Rao-Blackwell theorem), T must be a function of 
U(X) and is the unique best unbiased estimator of V. 


Example 4.22. Consider the general linear model (3.25) with assumption 
Al and a full rank Z. Let ? = @. An unbiased estimator of @ is then the 
LSE B. From the proof of Theorem 3.7, Bi is a function of the sufficient and 
complete statistic for 6 = (3,07). Hence, B is the unique best unbiased 
estimator of J under any strictly convex loss function. In particular, B is 
the UMVUE of ( under the loss function (4.37). 1 


Next, we consider Bayes estimators of J, which is still defined to be 
Bayes actions considered as functions of X. Under the loss function (4.37), 
the Bayes estimator is still given by (4.4) with vector-valued g(0) = ¥. 


268 4, Estimation in Parametric Models 


Example 4.23. Let X = (Xo,X1,...,Xx) have the multinomial dis- 
tribution given in Example 2.7. Consider the estimation of the vector 
0 = (po, P1,---;Pk) under the loss function (4.37), and the Dirichlet prior 
for # that has the Lebesgue p.d.f. 


T(ag +++: + Qk) ao-1 = 
lla Sica 6 tk TA(@ 4.38 
T(ao)- Flan) 2 py," La(9), (4.38) 


where a,’s are known positive constants and A = {0:0 < py, yar pj = 1}. 
It turns out that the Dirichlet prior is conjugate so that the posterior of 0 
given X = z is also a Dirichlet distribution having the p.d.f. given by (4.38) 
with a; replaced by a; + 2;, 7 =0,1,...,k. Thus, the Bayes estimator of 0 
is 6 = (00, 01,..., 0%) with 

5;(X) = ee Fe 


Aj tay+e:+artn 


After a suitable class of transformations is defined, the results in §4.2 
for invariant estimators and MRIE’s are still valid. This is illustrated by 
the following example. 


Example 4.24. Let X be a sample with the Lebesgue p.d.f. f(x — 6), 
where f is a known Lebesgue p.d.f. on R” with a finite second moment and 
6 € R? is an unknown parameter. Consider the estimation of @ under the 
loss function (4.37). This problem is invariant under the location transfor- 
mations g(X) = X +c, where c € R?”. Invariant estimators of 6 are of the 
form X +1, 1 € RP”. It is easy to show that any invariant estimator has 
constant bias and risk (a generalization of Proposition 4.4) and the MRIE 
of @ is the unbiased invariant estimator. In particular, if f is the p.d.f. of 
N,(0, Ip), then the MRIE is X. 0 


The definition of minimax estimators applies without changes. 


Example 4.25. Let X be a sample from N,(0,J,) with an unknown 
6 € R”. Consider the estimation of 0 under the loss function (4.37). A 
modification of the proof of Theorem 4.12 with independent priors for 6;’s 
shows that X is a minimax estimator of @ (exercise). 


Example 4.26. Consider Example 4.23. If we choose ag = --: = ag = 
/n/(k +1), then the Bayes estimator of 06 in Example 4.23 has constant 
risk. Using the same argument in the proof of Theorem 4.11, we can show 
that this Bayes estimator is minimax. U4 


The previous results for simultaneous estimation are fairly straightfor- 
ward generalizations of those for the case of a real-valued J. Results for 


4.3. Minimaxity and Admissibility 269 


admissibility in simultaneous estimation, however, are quite different. A 
surprising result, due to Stein (1956), is that in estimating the vector mean 
0 = EX of a normally distributed p-vector X (Example 4.25), X is in- 
admissible under the loss function (4.37) when p > 3, although X is the 
UMVUE, MRIE (Example 4.24), and minimax estimator (Example 4.25). 
Since any estimator better than a minimax estimator is also minimax, there 
exist many (in fact, infinitely many) minimax estimators in Example 4.25 
when p > 3, which is different from the case of p = 1 in which X is the 
unique admissible minimax estimator (Example 4.6 and Theorem 4.13). 


We start with the simple case where X is from N,(6,J,) with an un- 
known 6 € R?. James and Stein (1961) proposed the following class of 
estimators of ¥ = @ having smaller risks than X when the loss is given by 
(4.37) and p > 3: ; 

os 
bc =X Ix ee * C), (4.39) 
where c € R® is fixed. The choice of c is discussed next and at the end of 
this section. 


Before we prove that 5, in (4.39) is better than X, we try to motivate 
6. from two viewpoints. First, suppose that it were thought a priori likely, 
though not certain, that 0 = c. Then we might first test a hypothesis 
Hyp : 6 =c and estimate 0 by c if Ho is accepted and by X otherwise. The 
best rejection region has the form || X — c||? > t for some constant t > 0 
(see Chapter 6) so that we might estimate 0 by 


I t,00)(||X — ell?) X + [1 — Let,00) (|X — ell? )le- 


It can be seen that 5, in (4.39) is a smoothed version of this estimator, 
since 


Je = (|X — ell?)X + [1 — v(I|X — ell? )]e (4.40) 
for some function ~. Any estimator having the form of the right-hand side 
of (4.40) shrinks the observations toward a given point c and, therefore, is 
called a shrinkage estimator. 

Next, d- in (4.40) can be viewed as an empirical Bayes estimator (§4.1.2). 
In view of (2.25) in Example 2.25, a Bayes estimator of @ is of the form 


56 =(1-B)X 4+ Be, 


where c is the prior mean of @ and B involves prior variances. If 1 — B is 
“estimated” by w(||X — c||?), then 6, is an empirical Bayes estimator. 


Theorem 4.15. Suppose that X is from N,(0,J,) with p > 3. Then, 
under the loss function (4.37), the risks of the following estimators of 6, 


r(p — 2) 


der = X — — 
|| X — ell? 


(X — 0), (4.41) 


270 4, Estimation in Parametric Models 


are given by 
Ry,,, (0) =p — (2r —r?)(p— 2)? B(||X — el|-*), (4.42) 
where c € R? and r € FR are known. 


Proof. Let Z = X —c. Then 


2 


Rs, (0) = Ellen — BCX) |? = e[f.-2 eo?) 2 - Ee) 


ae 


Hence, we only need to show the case of c = 0. Let h(@) = Rs,,,.(8), g(9) be 
the right-hand side of (4.42) with c = 0, and ma(0) = (2ma)~?/2e~IAll?/2e) , 
which is the p.d.f. of N,(0,aI,). Note that the distribution of X can be 
viewed as the conditional distribution of X given 0 = 0, where @ has the 
Lebesgue p.d.f. 74(0). Then 


the 9(9)%o(0)dd = p— (2r — r*)(p — 2) E[E(||X||-7|8)] 


— (2r — r?)(p — 2)? E((|X||-*) 
SQrar)\@-2)/(a+d), 


Pp 
Pp 


l| 


where the expectation in the second line of the previous expression is w.r.t. 
the joint distribution of (X,0) and the last equality follows from the fact 
that the marginal distribution of X is N,(0,(a+1)Ip), ||X||?/(a+1) has the 
chi-square distribution x3 and, therefore, E(||X||~ oe = 1/[(p — 2)(a + 1)]. 


Let B =1/(a+1) and B = r(p — 2)/||X|/?. Then 


[ron .(0)d6 = B\\(1 — B)X — 6)? 
RP 


= E{E|||\(1 — B)X — 6||?|X]} 
= E{E|||@ — E(6|X)|l?|X] 
+ ||E(@|X) — (1 — B)X||?} 
= E{p(1— B) + (B- B)?||X||?} 
= E{p(1 — B) + B?||X|/? 
2Br(p— 2) + r*(p— 2)?||X||-7} 
= p— (2r—r*)(p— 2)B, 


where the fourth equality follows from the fact that the conditional distri- 
bution of @ given X is N,((1—B)X, (1—B)J,) and the last equality follows 
from E||X||~? = B/(p — 2) and E||X||? = p/B. This proves 


| g(0)ma(0)d0 = ‘f h(0)ta(0)d0, a>0. (4.43) 
RP Rp 


4.3. Minimaxity and Admissibility 271 


Note that h(@) and g(@) are expectations of functions of ||X||?, 67X, 
and ||@||?.. Make an orthogonal transformation from X to Y such that 
Y, = 07X/||6||, EY; = 0 for 7 > 1, and Var(Y) = I,. Then A(6) and g(6) 
are expectations of functions of Yj, }°4_» Y7, and ||6||?. Thus, both h and 
g are functions of ||6]|?. 


For the family of p.d-f.’s {7a(@) : a > O}, ||A||? is a complete and 
sufficient “statistic”. Hence, (4.43) and the fact that h and g are functions 
of ||@||? imply that h(@) = g(0) a.e. w.r.t. the Lebesgue measure. From 
Theorem 2.1, both h and g are continuous functions of ||6||? and, therefore, 
h(@) = g(@) for all 6 € R”. This completes the proof. I 


It follows from Theorem 4.15 that the risk of 6.,- is smaller than that 
of X (for every value of #) when p > 3 and 0 < r < 2, since the risk of X is 
p under the loss function (4.37). From Example 4.6, X is admissible when 
p=1. When p = 2, X is still admissible (Stein, 1956). But we have just 
shown that X is inadmissible when p > 3. 


The James-Stein estimator 6, in (4.39), which is a special case of (4.41) 
with r = 1, is better than any 6.,, in (4.41) with r ¥ 1, since the factor 
2r — r? takes on its maximum value 1 if and only if r = 1. To see that 6, 
may have a substantial improvement over X in terms of risks, consider the 
special case where @ = c. Since ||X —c||? has the chi-square distribution x7 
when 6 = c, E||X —c||~? = (p—2)~! and the right-hand side of (4.42) equals 
2. Thus, the ratio Rx (0)/R5,(0) equals p/2 when @ = c and, therefore, can 
be substantially larger than 1 near 6 = c when p is large. 


Since X is minimax (Example 4.25), any shrinkage estimator of the form 
(4.41) is minimax provided that p > 3 and0 <r < 2. 


Unfortunately, the James-Stein estimator with any c is also inadmissible. 
It is dominated by 


fs ‘ p- 2 : 

6g =X min {1 phx c); (4.44) 
see, for example, Lehmann (1983, Theorem 4.6.2). This estimator, however, 
is still inadmissible. An example of an admissible estimator of the form 
(4.40) is provided by Strawderman (1971); see also Lehmann (1983, p. 
304). Although neither the James-Stein estimator 5. nor 67 in (4.44) is 
admissible, it is found that no substantial improvements over 6+ are possible 
(Efron and Morris, 1973). 


To extend Theorem 4.15 to general Var(X), we consider the case where 
Var(X) = o?D with an unknown o? > 0 and a known positive definite 
matrix D. If o? is known, then an extended James-Stein estimator is 


~ r(p — 2)o? 


Ser = xX 7 [Dx -ol2 (Xx = c). (4.45) 


272 4, Estimation in Parametric Models 


One can show (exercise) that under the loss (4.37), the risk of 6, is 
o” [tr(D) — (2r — r?)(p — 2)?0? E(||D~*(X — c)||7*)] . (4.46) 


When o? is unknown, we assume that there exists a statistic S} such 
that 9% is independent of X and 93/0? has the chi-square distribution y?, 
(see Example 4.27). Replacing ro? in (4.45) by 6? = t$% with a constant 
t > 0 leads to the following extended James-Stein estimator: 


on (p — 2)6? ae aa 
bc = X [Dx oz” (X —c). (4.47) 


By (4.46) and the independence of 6? and X, the risk of 5, (as an estimator 
of ) = EX) is 


R5, (0) = E [E(\\be — ¥I!?16%)| 
= B |B (5.62/02) — v"127)] 
= 0° E {tr(D) — [2(6?/07) — (6°/07)?|(p — 2)?o* (8) } 
= 0? {tr(D) — [2E(6?/0*) — E(6*/07)?|(p — 2)?o* (8) } 
= 0” {tr(D) — [2tm — t?m(m + 2)|(p — 2)?o7K(0)} , 


where 6 = (0,07) and K(0) = E(||D~1(X —c)||~?). Since 2tm—t?m(m +2) 
is maximized at t = 1/(m + 2), replacing t by 1/(m + 2) leads to 


R5,(0) = 07 [tr(D) — m(m + 2)" (p — 2)?o* E(|D~"(X — e)||-)] . 


Hence, the risk of the extended James-Stein estimator in (4.47) is smaller 
than that of X for any fixed 0, when p > 3. 


Example 4.27. Consider the general linear model (3.25) with assumption 
Al, p > 3, and a full rank Z, and the estimation of J) = @ under the loss 
function (4.37). From Theorem 3.8, the LSE B is from N(G,07D) with a 
known matrix D = (Z7Z)~!; $2 = SSR is independent of 3; and $?2/c? 
has the chi-square distribution oa Hence, from the previous discussion, 
the risk of the shrinkage estimator 


4 __(p— 2)6? 


= - VAD ACG eons 
|27 Z(G — c)||? ee 


is smaller than that of B for any @ and o”, where c € R? is fixed and 
67 =SSR/(n—p+2). Ut 


From the previous discussion, the James-Stein estimators improve X 
substantially when we shrink the observations toward a vector c that is near 


4.4. The Method of Maximum Likelihood 273 


v0 = EX. Of course, this cannot be done since 7 is unknown. One may 
consider shrinking the observations toward the mean of the observations 
rather than a given point; that is, one may obtain a shrinkage estimator by 
replacing c in (4.39) or (4.47) by X Jp, where X = p~!7?_, X; and Jp is 
the p-vector of ones. However, we have to replace the factor p— 2 in (4.39) 
or (4.47) by p — 3. This leads to shrinkage estimators 


p—3 — 
ayes ane (X — XJ,) (4.48) 
and 45 
ipa se Der. (4.49) 


i eer eee MoE 
|D-1(X — XJp)||? 


These estimators are better than X (and, hence, are minimax) when p > 4, 
under the loss function (4.37) (exercise). 


The results discussed in this section for the simultaneous estimation 
of a vector of normal means can be extended to a wide variety of cases 
where the loss functions are not given by (4.37) (Brown, 1966). The results 
have also been extended to exponential families and to general location pa- 
rameter families. For example, Berger (1976) studied the inadmissibility 
of generalized Bayes estimators of a location vector; Berger (1980) consid- 
ered simultaneous estimation of gamma scale parameters; and Tsui (1981) 
investigated simultaneous estimation of several Poisson parameters. See 
Lehmann (1983, pp. 320-330) for some further references. 


4.4 The Method of Maximum Likelihood 


So far we have studied estimation methods in parametric families using the 
decision theory approach. The mazimum likelihood method introduced next 
is the most popular method for deriving estimators in statistical inference 
that does not use any loss function. 


4.4.1 The likelihood function and MLE’s 


To introduce the idea, let us consider an example. 


Example 4.28. Let X be a single observation taking values from {0, 1,2} 
according to P9, where @ = 69 or 4, and the values of Pg, ({i}) are given by 
the following table: 


0.8 0.1 0.1 
0.2 0.3 0.5 


274 4, Estimation in Parametric Models 


If X = 0 is observed, it is more plausible that it came from P%,, since 
Po,({0}) is much larger than Py, ({0}). We then estimate 6 by 6. On 
the other hand, if X = 1 or 2, it is more plausible that it came from Po,, 
although in this case the difference between the probabilities is not as large 
as that in the case of X = 0. This suggests the following estimator of 6: 


0 X=0 
ae oe xan 


The idea in Example 4.28 can be easily extended to the case where Pg 
is a discrete distribution and 6 € O C R*. If X = z is observed, 6; is more 
plausible than 62 if and only if Po, ({z}) > Po,({x}). We then estimate 
0 by a @ that maximizes P,({x}) over 0 € ©, if such a 6 exists. The 
word plausible rather than probable is used because 6 is considered to be 
nonrandom and P is not a distribution of 6. Under the Bayesian approach 
with a prior that is the discrete uniform distribution on {61, ..., Am}, Po({x}) 
is proportional to the posterior probability and we can say that 6; is more 
probable than 02 if Po, ({x}) > Po, ({r}). 

Note that P ({z}) in the previous discussion is the p.d.f. w.r.t. the 
counting measure. Hence, it is natural to extend the idea to the case of 
continuous (or arbitrary) X by using the p.d.f. of X w.r.t. some o-finite 
measure on the range X of X. This leads to the following definition. 


Definition 4.3. Let X € X be a sample with a p.d-f. fg w.r.t. a o-finite 
measure v, where 9€ Oc R*. 

(i) For each x € X, fo(x) considered as a function of 0 is called the likelihood 
function and denoted by ¢(@). 

(ii) Let © be the closure of ©. A 6 € © satisfying ¢(0) = maxgce6 C(9) is 
called a maximum likelihood estimate (MLE) of 6. If 6 is a Borel function 
of X a.e. v, then 6 is called a maximum likelihood estimator (MLE) of 0. 
(iii) Let g be a Borel function from © to R?, p < k. If 6 is an MLE of 8, 
then J = g(6) is defined to be an MLE of 0 = g(). 1 


Note that © instead of © is used in the definition of an MLE. This is 
because a maximum of ¢(@) may not exist when 9 is an open set (Examples 
4.29 and 4.30). As an estimator, an MLE is defined a.e. v. Part (iii) of 
Definition 4.3 is motivated by a fact given in Exercise 95 of §4.6. 

If the parameter space © contains finitely many points, then @ = O 
and an MLE can always be obtained by comparing finitely many values 
£(0), 0 € O. If &(6) is differentiable on 0°, the interior of ©, then possible 
candidates for MLE’s are the values of 6 € O° satisfying 


ae(0) 


—_—_—_ — 4. 
ap (4.50) 


4.4. The Method of Maximum Likelihood 275 


which is called the likelihood equation. Note that 0’s satisfying (4.50) may 
be local or global minima, local or global maxima, or simply stationary 
points. Also, extrema may occur at the boundary of © or when ||6|| — oo. 
Furthermore, if @(@) is not always differentiable, then extrema may occur 
at nondifferentiable or discontinuity points of €(0). Hence, it is important 
to analyze the entire likelihood function to find its maxima. 

Since log is a strictly increasing function and ¢(@) can be assumed 
to be positive without loss of generality, 6 is an MLE if and only if it 
maximizes the log-likelihood function log (6). It is often more convenient 
to work with log ¢(@) and the following analogue of (4.50) (which is called 
the log-likelihood equation or likelihood equation for simplicity): 


O log £(6) 


ay =O (4.51) 


Example 4.29. Let Xj,...,X, be iid. binary random variables with 
P(X; = 1) =pe O = (0,1). When (Xj,..., Xn) = (a1, ..., Un) is observed, 
the likelihood function is 


n 


ep) = [[ p™ ap) = p (1 — pyr, 


i=1 


where = n~1)~"_, a;. Note that O = [0,1] and 6° = O. The likelihood 
equation (4.51) reduces to 


If 0 < & < 1, then this equation has a unique solution Z. The second-order 
derivative of log &(p) is 

ni n(1—2Z) 

po esp) 
which is always negative. Also, when p tends to 0 or 1 (the boundary of 
0), £(p) — 0. Thus, % is the unique MLE of p. 

When Z = 0, ¢(p) = (1 — p)” is a strictly decreasing function of p and, 
therefore, its unique maximum is 0. Similarly, the MLE is 1 when z = 1. 
Combining these results with the previous result, we conclude that the MLE 
of p is Z. 

When & = 0 or 1, a maximum of ¢(p) does not exist on O = (0,1), 
although sup,¢(o,1) 4(p) = 1; the MLE takes a value outside of © and, 
hence, is not a reasonable estimator. However, if p € (0,1), the probability 
that = 0 or 1 tends to 0 quickly asn — oo. JI 


276 4, Estimation in Parametric Models 


Example 4.29 indicates that, for small n, a maximum of ¢(@) may not 
exist on O and an MLE may be an unreasonable estimator; however, this 
is unlikely to occur when n is large. A rigorous result of this sort is given 
in §4.5.2, where we study asymptotic properties of MLE’s. 


Example 4.30. Let X1,..., Xp be iid. from N(,07) with an unknown 
6 = (41,07), where n > 2. Consider first the case where O = R x (0,00). 
When (Xj,...,.Xn) = Ke ..+)Ln) is observed, the log-likelihood function is 


log £(0 aor Da - ? — = log o? = 5 los(2r). 
=1 
) becomes 


The likelihood equation (4.51 


1< 1 n 


Sistess the first equation in (4.52) for 4, we obtain a unique solution % = 
n-' So", vj, and substituting % for w in the second equation in (4.52), 
we obtain a unique solution 62 = n~!37"_, (a; — @)?. To show that 6 = 
(z, 67) is an MLE, first note that © is an open set and ((@) is differentiable 
everywhere; as 0 tends to the boundary of 9 or ||6|| — oo, £(0) tends to 0; 


and 
J log £0) _ ( a ot Doin (ti — H+) 
000" ADL w) &DLe—w-ge 
is negative definite when js = Z and 0? = 6?. Hence 6 is the unique MLE. 
Sometimes we can avoid the calculation of the second-order derivatives. 
For instance, in this example we know that ¢(@) is bounded and £(0) — 0 
as ||@|| — oo or @ tends to the boundary of ©; hence the unique solution 


to (4.52) must be the MLE. Another way to show that 6 is the MLE is 
indicated by the following discussion. 


Consider next the case where 0 = (0,00) x (0,00), ie, w is known 
to be positive. The likelihood function is differentiable on O° = © and 
© = (0,00) x [0,00). If > 0, then the same argument for the previous 
case can be used to show that (7,67) is the MLE. If z < 0, then the first 
equation in (4.52) does not have a solution in ©. However, the function 
log £(0) = log €(y1, 07) is strictly decreasing in y for any fixed o?. Hence, a 
maximum of log ¢(1, 07) is 4p = 0, which does not depend on o?. Then, the 
MLE is (0,67), where G? is the value maximizing log (0,07) over o? > 0. 
Applying (4.51) to the function log @(0,07) leads to 6? = n7* 3, a?. 
Thus, the MLE is 


4.4. The Method of Maximum Likelihood 277 


Again, the MLE in this case is not in 0 if  < 0. One can show that a 
maximum of £(0) does not exist on 0 when & <0. IJ 


Example 4.31. Let Xj,..., X, be i.i.d. from the uniform distribution on an 
interval Z with an unknown 0. First, consider the case where Zg = (0,0) 
and @ > 0. The likelihood function is €(@) = @7"I(«,,.),00)(@), which is 
not always differentiable. In this case O° = (0, 2(n)) U (im), 00). But, on 
(0, z(n)), € = 0 and on (a(n), 00), (8) = —nd"~! <0 for all 6. Hence, the 
method of using the likelihood equation is not applicable to this problem. 
Since ¢(@) is strictly decreasing on (2), 00) and is 0 on (0, 2(,)), a unique 
maximum of ¢(@) is x(,), which is a discontinuity point of (0). This shows 
that the MLE of @ is the largest order statistic Xn). 

Next, consider the case where Zy = (0 — 3,0 +4) with 0 € R. The 
likelihood function is £(@) = Fieiey— hs, ee (6). Again, the method of 
using the likelihood equation is not aoe: However; it follows fom 
Definition 4.3 that any statistic T(X) satisfying x(,) — 5 < T(x) <q) +5 
is an MLE of 6. This example indicates that MLE’s may not be unique and 
can be unreasonable. I 


Example 4.32. Let X be an observation from the hypergeometric dis- 
tribution HG(r,n,@ —n) (Table 1.1, page 18) with known r, n, and an 
unknown 6 = n+1,n+2,.... In this case, the likelihood function is defined 
on integers and the method of using the likelihood equation is certainly not 
applicable. Note that 
&(@) _ (0 =1r)(—n) 
£(@-1) 0(@-n-—r+z)’ 

which is larger than 1 if and only if 6 < rn/a and is smaller than 1 if and 


only if 0 > rn/x. Thus, £(@) has a maximum 0 = the integer part of rn/z, 
which is the MLE of 6. I 


Example 4.33. Let Xj,...,X, be ii.d. from the gamma distribution 
T'(a,7) with unknown a > 0 and y > 0. The log-likelihood function is 


log (0) = —nalogy — nlogI'(a) + (a—1 ) Stowe, 59m 


and the likelihood equation (4.51) becomes 


&) + So log x; =0 
i=1 


nt 
—nlogy — 


and 


na 1 
—-—+ 5) °2=0. 
¥ poe 


278 4, Estimation in Parametric Models 


The second equation yields 7 = £/a. Substituting y = £/a into the first 
equation we obtain that 


(ia) 1 = 
log a — Te)" nee —logz=0. 


In this case, the likelihood equation does not have an explicit solution, 
although it can be shown (exercise) that a solution exists almost surely and 
it is the unique MLE. A numerical method has to be applied to compute 
the MLE for any given observations £1,...,0,. 


These examples indicate that we need to use various methods to derive 
MLE’s. In applications, MLE’s typically do not have analytic forms and 
some numerical methods have to be used to compute MLE’s. A commonly 
used numerical method is the Newton-Raphson iteration method, which 
repeatedly computes 


0? log £(0) 
———— 4. 
06007 : 229) 


Atty — A) _ | 
9=6) 


| J log £(0) 
=6) 00 


t = 0,1,..., where 6 is an initial value and 0? log £(9)/8000" is assumed of 
full rank for every 0 € O. If, at each iteration, we replace 0? log ¢(0)/00007 
in (4.53) by its expected value E[0? log (0) /00007|, where the expectation 
is taken under Pg, then the method is known as the Fisher-scoring method. 
If the iteration converges, then A() or 6) with a sufficiently large t is a 
numerical approximation to a solution of the likelihood equation (4.51). 


The following example shows that the MCMC methods discussed in 
§4.1.4 can also be useful in computing MLE’s. 


Example 4.34. Let X be a random k-vector from Pg with the following 
p.d.f. w.r.t. a o-finite measure v: 


AOE i: folt,u)dv(y), 


where fo(z,y) is a joint p.df. wrt. v x v. This type of distribution is 
called a mixture distribution. Thus, the likelihood £(0) = fo(x) involves a 
k-dimensional integral. In many cases this integral has to be computed in 
order to compute an MLE of 0. 


Let m(0) be the MCMC approximation to (9) based on one of the 
MCMC methods described in §4.1.4 and a Markov chain of length m. Under 
the conditions of Theorem 4.4, ln (0) —a.s. €(@) for every fixed @ and a. 
Suppose that, for each m, there exists 6,, that maximizes lm(0) over 0 € O. 


Geyer (1994) studies the convergence of @m to an MLE. 1 


4.4. The Method of Maximum Likelihood 279 


In terms of their mse’s, MLE’s are not necessarily better than UMVUE’s 
or Bayes estimators. Also, MLE’s are frequently inadmissible. This is 
not surprising, since MLE’s are not derived under any given loss function. 
The main theoretical justification for MLE’s is provided in the theory of 
asymptotic efficiency considered in 84.5. 


4.4.2 MLE’s in generalized linear models 


Suppose that X has a distribution from a natural exponential family so 
that the likelihood function is 


&(n) = expt" T(x) — C(n) fh(a), 


where 7 € & is a vector of unknown parameters. The likelihood equation 
(4.51) is then 

log &(n) _ a(n) 

EAD — r(x) - SD = 0, 

an on 
which has a unique solution T(#) = 0¢(n)/On, assuming that T(x) is in the 
range of 0¢(7)/On. Note that 
log &(n) _ _ O°C(n) 


a = T 4.54 
OnOnt OnOnt ve) od) 


(see the proof of Proposition 3.2). Since Var(T') is positive definite, 
—log (7) is convex in 7 and T(a) is the unique MLE of the parameter 
u(n) = 0¢(n)/On. By (4.54) again, the function (7) is one-to-one so that 
pu | exists. By Definition 4.3, the MLE of 7 is f = w~1(T(2)). 

If the distribution of X is in a general exponential family and the like- 
lihood function is 


€(8) = exp{[n(@)|" T(x) — €(@)th(2), 


then the MLE of @ is 6 = n~1(#), if 7+ exists and # is in the range of (0). 
Of course, @ is also the solution of the likelihood equation 
Alog &(8) _ dn(6) a€(6) 


i Oe op 


The results for exponential families lead to an estimation method in a 
class of models that have very wide applications. These models are gener- 
alizations of the normal linear model (model (3.25) with assumption A1) 
discussed in §3.3.1-§3.3.2 and, therefore, are named generalized linear mod- 
els (GLM). 


280 4, Estimation in Parametric Models 


A GLM has the following structure. The sample X = (X1,..., Xn) € R” 
has independent components and X; has the p.d_f. 


exp { M289} h(x, di), pe lees, (4.55) 
w.r.t. a o-finite measure v, where 7; and ¢; are unknown, ¢; > 0, 
nme E={n: 0< f h(x, d)e™/Pdv(x) < of} CR 


for all 7, ¢ and h are known functions, and ¢”(7) > 0 is assumed for all 
n € &°, the interior of =. Note that the p.d.f. in (4.55) belongs to an 
exponential family if 6; is known. As a consequence, 


E(X;) = C'(m) and Var(X;) = bil” (Mm), = 1, seo The (4.56) 


Define u(7) = ¢’(7). It is assumed that 7; is related to Z;, the th value of 
a p-vector of covariates (see (3.24)), through 


where (@ is a p-vector of unknown parameters and g, called a link function, 
is a known one-to-one, third-order continuously differentiable function on 
{u(n):n € Eo}. If uw = gH, then n; = 87 Z; and g is called the canonical or 
nas link function. If g is not canonical, we assume that (go L)(n) #0 
for all 7. 


In a GLM, the parameter of interest is G. We assume that the range 
of Bis B = {8 : (gop)1(67z) € E® forall z € Z}, where Z is the 
range of Z;’s. ;’s are called dispersion parameters and are considered to 
be nuisance parameters. It is often assumed that 


with an unknown ¢ > 0 and known positive t;’s. 


As we discussed earlier, the linear model (3.24) with ¢; = N(0,¢) isa 
special GLM. One can verify this by taking g(j) = us and ¢(n) = 77/2. The 
usefulness of the GLM is that it covers situations where the relationship 
between E(X;) and Z; is nonlinear and/or X;’s are discrete (in which case 
the linear model (3.24) is clearly not appropriate). The following is an 
example. 


Example 4.35. Let X;,’s be independent discrete random variables taking 
values in {0,1,...,m}, where m is a known positive integer. First, suppose 
that X; has the binomial distribution Bi(p;,m) with an unknown p; € 
(0,1),¢=1,...,n. Let m = log 7 and ¢(m) = mlog(1 + e™). Then the 
p.d.f. of X; (w.r.t. the counting measure) is given by (4.55) with ¢; = 1, 


4.4. The Method of Maximum Likelihood 281 


Tr z 
me" me 4 


BO) SUPE Fem Tees. 


Another popular link in this problem is the probit link g(t) = @~1(t/m), 
where ® is the c.d.f. of the standard normal. Under the probit link, E(X;) = 
m®(G7 Z;). 

The variance of X; is mp;(1 — p;) under the binomial distribution as- 
sumption. This assumption is often violated in applications, which results 
in an over-dispersion, i.e., the variance of X; exceeds the nominal vari- 
ance mp;(1 — p;). Over-dispersion can arise in a number of ways, but the 
most common one is clustering in the population. Families, households, 
and litters are common instances of clustering. For example, suppose that 
X,= ded Xj;, where X;,; are binary random variables having a common 
distribution. If X;;’s are independent, then X; has a binomial distribution. 
However, if X,,;’s are from the same cluster (family or household), then 
they are often positively correlated. Suppose that the correlation coeff- 
cient (§1.3.2) between X,; and Xi, 7 A 1, is p; > 0. Then 


Var(X;) = mpi(1 — pi) + m(m — 1)pipi(l — pi) = dimpi(1 — pi), 


where ¢; = 1+ (m— 1)p; is the dispersion parameter. Of course, over- 
dispersion can occur only if m > 1 in this case. 

This motivates the consideration of GLM (4.55)-(4.57) with dispersion 
parameters ¢;. If X; has the p.d.f. (4.55) with ¢(m) = mlog(1+e™), then 


mem me 
= ds Var(Xi) = %-——> 
1lt+em si ar(Xi) = ¢ (1 +e)? 


E(X;) 


which is exactly (4.56). Of course, the distribution of X; is not binomial 
unless ¢; = 1. J 


We now derive an MLE of ( in a GLM under assumption (4.58). Let 
6 = (8,¢) and w = (gop)~t. Then the log-likelihood function is 


n 


log £(0) = S- ow h(xi, b/ti) + 


i=1 


(87 Zi)xi — C(W(8" Zi)) 
b/ti 


and the likelihood equation is 


a = => { [xi — w(b(8" Z))]W' (6 Zits Zi} = 0 (4.59) 


282 4, Estimation in Parametric Models 


and 


Dlogl(9) a f Alogh(#i,¢/ti) — tily(B" Zi)ai — C(W(8" Zi) 
ee 


do e 


i=1 


From the first equation, an MLE of (, if it exists, can be obtained without 
estimating ¢. The second equation, however, is usually difficult to solve. 
Some other estimators of ¢ are suggested by various researchers; see, for 
example, McCullagh and Nelder (1989). 


Suppose that there is a solution 3 € B to equation (4.59). (The exis- 
tence of 3 is studied in §4.5.2.) We now study whether 3 is an MLE of £. 
Let 


n 


Mn (8) = SOW (9" Zi) POG Zi) i Zi (4.60) 
and " 
R,(8) = dle — p(b(B" Z;)) "(8° Ziti ZiZZ . (4.61) 
Then ai 
Var eo = M,(8)/¢ (4.62) 
8? log &(0) 
og 
paar [Rn(B) — Mn(8)]/¢.- (4.63) 


Consider first the simple case of canonical g. Then 7” = 0 and R, = 0. 
If M,,(@) is positive definite for all G, then — log ¢(@) is strictly convex in 
(@ for any fixed ¢ and, therefore, B is the unique MLE of (. For the case 
of noncanonical g, Rn(3) # 0 and is not necessarily an MLE. If R,,(() 
is dominated by M,,(3) (ie., [Mn(@)]7!/?Rn(8)[Mn(3)]-1/? — 0 in some 
sense), then — log ¢(0) is convex and is an MLE for large n; see more 
details in the proof of Theorem 4.18 in §4.5.2. 


Example 4.36. Consider the GLM (4.55) with ¢(7) = 7?/2,n € R. If g 
in (4.57) is the canonical link, then the model is the same as (3.24) with 
independent ¢;’s distributed as N(0,¢;). If (4.58) holds with t; = 1, then 
(4.59) is exactly the same as equation (3.27). If Z is of full rank, then 
M,,(8) = Z7Z is positive definite. Thus, we have shown that the LSE B 
given by (3.28) is actually the unique MLE of (3. 

Suppose now that g is noncanonical but (4.58) still holds with ¢; = 1. 
Then the model reduces to the one with independent X;’s and 


X; = N (g"*(8"Z,), ¢), 4 15 ahs (4.64) 


4.4. The Method of Maximum Likelihood 283 


This type of model is called a nonlinear regression model (with normal 
errors) and an MLE of ' under this model is also called a nonlinear LSE, 
since maximizing the log-likelihood is equivalent to minimizing the sum of 
squares )>;"_,[X;—g~1(G7 Z;)]*. Under certain conditions the matrix R,(3) 
is dominated by M,,(G) and an MLE of 6 exists. More details can be found 
in §4.5.2. I 


Example 4.37 (The Poisson model). Consider the GLM (4.55) with ¢(7) = 
e”,7 ER. If d; =1, then X; has the Poisson distribution with mean e™. 
Assume that (4.58) holds. Under the canonical link g(t) = logt, 


M,,(@) = Ss LT VAVAD 
i=l 


which is positive definite if inf; e°" 7° > 0 and the matrix (V4, 21, ..., Vin Zn) 
is of full rank. 

There is one noncanonical link that deserves attention. Suppose that 
we choose a link function so that [y’(t)]?¢(w(t)) = 1. Then M,,(3) = 
yy, Zi Z7 does not depend on (3. In §4.5.2 it is shown that the asymp- 
totic variance of the MLE @ is ¢[M,(@)|~!. The fact that M,,(3) does not 
depend on @ makes the estimation of the asymptotic variance (and, thus, 
statistical inference) easy. Under the Poisson model, ¢’(t) = e' and, there- 
fore, we need to solve the differential equation [:'(t)|?e” = 1. A solution 
is Y(t) = 2log(t/2), which gives the link function g() = 2,/p. 


In a GLM, an MLE B usually does not have an analytic form. A numer- 
ical method such as the Newton-Raphson or the Fisher-scoring method has 
to be applied. Using the Newton-Raphson method, we have the following 
iteration procedure: 


peo) = Bo . [Rn (8) = M,(6))-*sn(6), t= 0, 1, 2] 
where s,(3) = ¢0log &(0)/08. Note that E[R,(G)| = 0 if G is the true 
parameter value and 2x; is replaced by X;. This means that the Fisher- 
scoring method uses the following iteration procedure: 


Bl) = BO + (Ma(B)254(8), = 01,... 


If the canonical link is used, then the two methods are identical. 


4.4.3 Quasi-likelihoods and conditional likelihoods 


We now introduce two variations of the method of using likelihoods. 


284 4, Estimation in Parametric Models 


Consider a GLM (4.55)-(4.57). Assumption (4.58) is often unrealistic in 
applications. If there is no restriction on ¢;’s, however, there are too many 
parameters and an MLE of 3 may not exist. (Note that assumption (4.58) 
reduces n nuisance parameters to one.) One way to solve this problem 
is to assume that 6; = h(Z;,€) for some known function fh and unknown 
parameter vector € (which may include @ as a subvector). Let 6 = (G,&). 
Then we can try to solve the likelihood equation 0 log ¢(0) /00 = 0 to obtain 
an MLE of @ and/or €. We omit the details, which can be found, for 
example, in Smyth (1989). 

Suppose that we do not impose any assumptions on ¢;’s but still esti- 
mate @ by solving 


5n(8) = dD {[xi — w(h(8" Z,))]}" (B" Z,)ti Zi} = O. (4.65) 


Note that (4.65) is not a likelihood equation unless (4.58) holds. In the 
special case of Example 4.36 where X; = N(Q7 Z;, ¢;), i = 1,...,n, a solution 
to (4.65) is simply an LSE of 3 whose properties are discussed at the end 
of §3.3.3. Estimating @ by solving equation (4.65) is motivated by the 
following facts. First, if (4.58) does hold, then our estimate is an MLE. 
Second, if (4.58) is slightly violated, the performance of our estimate is 
still nearly the same as that of an MLE under assumption (4.58) (see the 
discussion of robustness at the end of §3.3.3). Finally, estimators obtained 
by solving (4.65) usually have good asymptotic properties. As a special 
case of a general result in §5.4, a solution to (4.65) is asymptotically normal 
under some regularity conditions. 


In general, an equation such as (4.65) is called a quasi-likelihood equation 
if and only if it is a likelihood equation when certain assumptions hold. The 
“likelihood” corresponding to a quasi-likelihood equation is called quasi- 
likelihood and a maximum of the quasi-likelihood is then called a mazimum 
quasi-likelihood estimate (MQLE). Thus, a solution to (4.65) is an MQLE. 


Note that (4.65) is a likelihood equation if and only if both (4.55) and 
(4.58) hold. The LSE (§3.3) without normality assumption on X;,’s is a 
simple example of an MQLE without (4.55). Without assumption (4.55), 
the model under consideration is usually nonparametric and, therefore, the 
MQLE?’s are studied in 85.4. 


While the quasi-likelihoods are used to relax some assumptions in our 
models, the conditional likelihoods discussed next are used mainly in cases 
where MLE’s are difficult to compute. We consider two cases. In the first 
case, 0 = (61,42), 61 is the main parameter vector of interest, and 62 is a 
nuisance parameter vector. Suppose that there is a statistic T>(X) that is 
sufficient for #2 for each fixed 01. By the sufficiency, the conditional dis- 
tribution of X given T> does not depend on 62. The likelihood function 


4.4. The Method of Maximum Likelihood 285 


corresponding to the conditional p.d.f. of X given T is called the condi- 
tional likelihood function. A conditional MLE of 6; can then be obtained 
by maximizing the conditional likelihood function. This method can be 
applied to the case where the dimension of @ is considerably larger than 
the dimension of 0; so that computing the unconditional MLE of @ is much 
more difficult than computing the conditional MLE of @;. Note that the 
conditional MLE’s are usually different from the unconditional MLE’s. 


As amore specific example, suppose that X has a p.d.f. in an exponential 
family: 
fo(x) = exp{0[ Ti (x) + 63T2(ax) — ¢(A) }h(2). 
Then T> is sufficient for #2 for any given 0,;. Problems of this type are 


from comparisons of two binomial distributions or two Poisson distributions 
(Exercises 119-120). 


The second case is when our sample X = (X1,...,X,) follows a first- 
order autoregressive time series model: 


Xt — w= p(Xt-1 — w) + €t, t= 2,...,n, 


where ps € R and p € (—1,1) are unknown and ¢;’s are i.i.d. from N(0, 07) 
with an unknown o? > 0. This model is often a satisfactory representation 
of the error time series in economic models, and is one of the simplest 
and most heavily used models in time series analysis (Fuller, 1996). Let 
6 = (1, p,07). The log-likelihood function is 


log (0) = = log(27) — 5 logo + 5 loa =p”) 
a {te 1)? — p?) + Slee — pw plea - ort , 


t=2 


The computation of the MLE is greatly simplified if we consider the condi- 
tional likelihood given X1 = 2: 


n 


1 
logo” — 5-5 So [2-4 p(a1-1 —w))?. 


t=2 


ai -1 
log(27) — - 5 


log €(|a1) = —— 


Let (Z_1, Zo) = (n — i aay Sn Cree ee ra If 


n 


p= SG — £9)(x~-1 — oa) / Yen a E_1)* 


t=2 


is between —1 and 1, then it is the conditional MLE of p and the conditional 
MLE?’s of ys and oa? are, respectively, 


ft = (Zo — px-1)/(1 — p) 


286 4, Estimation in Parametric Models 


and a 
oC = (a, — Zo — O(@e_1 — E_1)]’. 
n—1 = 


Obviously, the result can be extended to the case where X follows a 
pth-order autoregressive time series model: 


Xp— b= pi(Xt-1— fb) +++ + pp(Xe-p— pw) ter,  t=ptl,..,n, (4.66) 


where p;’s are unknown parameters satisfying the constraint that the roots 
(which may be complex) of the polynomial «? — p,x?~! —---— p, = 0 are 
less than one in absolute value (exercise). 

Some other likelihood based methods are introduced in §5.1.4. Although 
they can also be applied to parametric models, the methods in 85.1.4 are 
more useful in nonparametric models. 


4.5 Asymptotically Efficient Estimation 


In this section, we consider asymptotic optimality of point estimators in 
parametric models. We use the asymptotic mean squared error (amse, 
see §2.5.2) or its multivariate generalization to assess the performance of 
an estimator. Reasons for considering asymptotics have been discussed in 
§2.5. 

We focus on estimators that are asymptotically normal, since this covers 


the majority of cases. Some cases of asymptotically nonnormal estimators 
are studied in Exercises 111-114 in §4.6. 


4.5.1 Asymptotic optimality 


Let {6,} be a sequence of estimators of 9 based on a sequence of samples 
{X = (X,...,Xn) : n = 1,2,...} whose distributions are in a parametric 
family indexed by 6. Suppose that as n — 00, 


[Vn (0)]-/?(On — 8) +a Ne(0, Ie), (4.67) 


where, for each n, V,,(0) is a k x k positive definite matrix depending on 
6. If 6 is one-dimensional (& = 1), then V,,(0) is the asymptotic variance as 
well as the amse of 6, (§2.5.2). When k > 1, V;,(0) is called the asymptotic 
covariance matrix of bn and can be used as a measure of asymptotic perfor- 
mance of estimators. If 6j,, satisfies (4.67) with asymptotic covariance ma- 
trix Vjn(@), 7 = 1,2, and Vin(@) < Von(@) (in the sense that Vo, (6) — Vin (0) 
is nonnegative definite) for all 6 € ©, then 61m is said to be asymptoti- 
cally more efficient than on. Of course, some sequences of estimators are 


4.5. Asymptotically Efficient Estimation 287 


not comparable under this criterion. Also, since the asymptotic covariance 
matrices are unique only in the limiting sense, we have to make our com- 
parison based on their limits. When X;’s are i.i.d., V,(@) is usually of the 
form n~°V (0) for some 6 > 0 (= 1 in the majority of cases) and a positive 
definite matrix V(0) that does not depend on n. 


Note that (4.67) implies that 6, is an asymptotically unbiased estimator 
of @. If V,(@) = Var(@,), then, under some regularity conditions, it follows 
from Theorem 3.3 that 


Vn (0) > Un(9))*, (4.68) 
where, for every n, I;,(@) is the Fisher information matrix (see (3.5)) for X of 
size n. (Note that (4.68) holds if and only if 17V,,(@)l > 17 [In (@)]~11 for every 
1 € R*.) Unfortunately, when V,,(0) is an asymptotic covariance matrix, 
(4.68) may not hold (even in the limiting sense), even if the regularity 
conditions in Theorem 3.3 are satisfied. 


Example 4.38 (Hodges). Let X1,...,.X, be iid. from N(6,1), 0 € R. 
Then I,(0) =n. Define 


a X |X| >n i 
"= VER Rpen 4, 


where t is a fixed constant. By Proposition 3.2, all conditions in Theorem 
3.3 are satisfied. It can be shown (exercise) that (4.67) holds with V,, (0) = 
V(0)/n, where V(0) = 1 if 040 and V(6) = ¢? if 6 =0. If t? < 1, (4.68) 
does not hold when#?@=0. J 


However, the following result, due to Le Cam (1953), shows that (4.68) 
holds for i.i.d. X;’s except for 6 in a set of Lebesgue measure 0. 


Theorem 4.16. Let X),..., Xp, be iid. from a p.d-f. fg w.r.t. a o-finite 
measure v on (R, B), where 6 € © and @ is an open set in R*. Suppose that 
for every x in the range of X1, fg(a) is twice continuously differentiable in 
é and satisfies 


0 0 
36 / wo(x)dv = 5g v0 (nav 
for Wo(x) = fo(x) and = Ofe(x)/00; the Fisher information matrix 


10) = BL 5 tos fa(Xi) |Z tos f(Xs)] | 


is positive definite; and for any given @ € O, there exists a positive number 
cg and a positive function hg such that E[hg(X1)] < oo and 


0? log f(z) 


. OVO" 


¥lly—All<ce 


288 4, Estimation in Parametric Models 


for all x in the range of X,, where ||A|| = ,/tr(A7A) for any matrix A. 
If 6, is an estimator of 6 (based on Xj,...,Xn) and satisfies (4.67) with 
V,(0) = V(6)/n, then there is a @9 C O with Lebesgue measure 0 such 
that (4.68) holds if 6 ¢ Oo. 

Proof. We adopt the proof given by Bahadur (1964) and prove the case 
of univariate 0. The proof for multivariate @ is similar and can be found in 
Bahadur (1964). Let x = (a1,...,2n), n =9+n-1/? € O, and 


Ky, (a, 8) = [log £(6n) — log €(0) + 11(0)/2]/[L(@)]}. 
Under the assumed conditions, it can be shown (exercise) that 
Kn(X, 0) +a N(0,1). (4.70) 


Let Po, (or Py) be the distribution of X under the assumption that X, 
has the p.df. fo, (or fo). Define gn(0) = |Pa(On < 0) — 4|. Let © denote 
the standard normal c.d.f. or its probability measure. By the dominated 
convergence theorem (Theorem 1.1(iii)), as m > co, 


i, 9n(On)d®(8) = ih gn(O)er 9-2)" (9) — 0, 


since g,(@) — 0 under (4.67). By Theorem 1.8(ii) and (vi), there exists a 
sequence {n;,} such that gn, (On,) a.s. 0 w.r.t. &. Since ® is equivalent to 
the Lebesgue measure, we conclude that there is a Og C O with Lebesgue 
measure 0 such that. 


jim Gnz (nx) = 9, 0 € Oo. (4.71) 
Assume that 0 ¢ Qo. Then, for any ¢ > [1,(0)]'/2, 


Po, (Kn(X, 0) <t) = €(0,)dv x ++» x dv 
Ky, (a,0)<t 


ee Cree 
7 [. (x,a)<t €(9) are) 


Bait OP) el (0)]"/? Kn (2.0) Py (a) 
Kn (2,0)< 
t 


e hl (0)/2 elfi( )\+/ “zdH, (z ) 


eo 1(8)/2 


is 
iE ell '/*=a@(z) + o(1) 


Co 


® (t— [4 (0))/”) + o(1), 


4.5. Asymptotically Efficient Estimation 289 


where H,, denotes the distribution of K,(X, @) and the next to last equality 
follows from (4.70) and the dominated convergence theorem. This result 
and result (4.71) imply that there is a sequence {n;} such that for j = 
129. os. ‘ 

Po,,,(On; S 9n;) < Po, (Kn;(X,9) < t). (4.72) 


By the Neyman-Pearson lemma (Theorem 6.1 in §6.1.1), we conclude that 
(4.72) implies that for 7 = 1, 2,..., 


Po (bn; < On;) < Po(Kn;(X,) < t). (4.73) 


(The reader should come back to this after reading §6.1.1.) From (4.70) 
and (4.67) with V,(0) = V(0)/n, (4.73) implies 


a([V(A)|-1?) < &). 


Hence [V(0)]~!/? < t. Since J,,(@) = nJ,(0) (Proposition 3.1(i)) and t is 
arbitrary but > [J,(0)|!/?, we conclude that (4.68) holds. 


Points at which (4.68) does not hold are called points of superefficiency. 
Motivated by the fact that the set of superefficiency points is of Lebesgue 
measure 0 under some regularity conditions, we have the following defini- 
tion. 


Definition 4.4. Assume that the Fisher information matrix [,,(0) is well 
defined and positive definite for every n. A sequence of estimators {6,,} sat- 
isfying (4.67) is said to be asymptotically efficient or asymptotically optimal 
if and only if V,(@) =[In(@)|-'. 0 


Suppose that we are interested in estimating 0 = g(@), where g is a 
differentiable function from © to R?, 1 < p< k. If 6, satisfies (4.67), 


then, by Theorem 1.12(i), Jn = g(On) is asymptotically distributed as 
N,(¥, [Vg(0)]" Vn(0)V9(@)). Thus, inequality (4.68) becomes 


[V9(A)] Vn (8)V9(8) = Un(8)]™*, 


where [,,() is the Fisher information matrix about J contained in X. If 
p=k and g is one-to-one, then 


[In(9))-* = [Vg()]" Un(®)]*V9(6) 


and, therefore, Jn is asymptotically efficient if and only if bn is asymptoti- 
cally efficient. For this reason, in the case of p < k, Jy, is considered to be 
asymptotically efficient if and only if 6, is asymptotically efficient, and we 
can focus on the estimation of 6 only. 


290 4, Estimation in Parametric Models 


4.5.2 Asymptotic efficiency of MLE’s and RLE’s 


We now show that under some regularity conditions, a root of the likeli- 
hood equation (RLE), which is a candidate for an MLE, is asymptotically 
efficient. 


Theorem 4.17. Assume the conditions of Theorem 4.16. 
(i) There is a sequence of estimators {0,} such that 


P(sn(6n)=0) 41 and 6,58, (4.74) 


where 8,,(7) = O log &(y)/O7. 

(ii) Any consistent sequence 6, of RLE’s is asymptotically efficient. 
Proof. (i) Let Bn(c) = {7 : ||[In(0)]'/2(y — 4)|| < c} for ¢ > 0. Since © 
is open, for each c > 0, B,(c) C © for sufficiently large n. Since B,(c) 
shrinks to {0} as n — oo, the existence of 6, satisfying (4.74) is implied by 
the fact that for any € > 0, there exists c > 0 and no > 1 such that 


P(log (7) — log £(0) <0 forall y€ OBn(c)) >1-—¢€, n2>no, (4.75) 


where 0B,,(c) is the boundary of B,,(c). (For a proof of the measurability of 
6, see Serfling (1980, pp. 147-148).) For 7 € 0B, (c), the Taylor expansion 
gives 
log (y) — log (8) = eX [In (0)] 7/7 5, (8) (4.76) 
+ (2?/2)N [In (O)) PV sn ("En (8) 1/72, 


where \ = [In(0)]'/?(y — 0)/c satisfying ||A|| = 1, Vsn(y) = Osn(7)/07, 
and y* lies between y and 6. Note that 


pl sna") — Vsn(9)I | Vsn(7) — Vsn(9)| 


< E max 
Ht +yEBn(c) n 
2 2 
<E max ||o sf) _ o log fol%s) 
7E Bn (¢) yoy" 00007 
=O; (4.77) 


which follows from (a) 0? log f,(x)/07y0y" is continuous in a neighborhood 
of @ for any fixed x; (b) B,(c) shrinks to {6}; and (c) for sufficiently large 
n, 

ee ol log fy (X71) = oO? log fo(%) 

VEBn(c) Oyo 0000T 

under condition (4.69). By the SLLN (Theorem 1.13) and Proposition 3.1, 
nV 8n(9) a.s. -—L1(9) (ie., ||n~1Vsn(0) + 11(0)|| a.s, 0). These results, 
together with (4.76), imply that 


| < 2he(X1) 


log £(y) — log (8) = eX [In (0))7/2.5n (8) — [1 + op(1)]c?/2. (4.78) 


4.5. Asymptotically Efficient Estimation 291 


Note that maxy{A7[In(@)]7/25n(8)} = ||[In(0)]71/28n(0)||. Hence, (4.75) 
follows from (4.78) and 


P(([En(@)}- 17 5n(8)I| < ¢/4) > 1— (4/0)? E llr (9)? 5n (8)? 
= 1-k(4/c)? 
>l-e 


by choosing c sufficiently large. This completes the proof of (i). 

(ii) Let Ae = {7 : |ly — 4] < e} for e > 0. Since © is open, A, Cc O 
for sufficiently small «. Let {6,} be a sequence of consistent RLE’s, i.e., 
P(8n(0n) = 0 and 6, € A) > 1 for any € > 0. Hence, we can focus on the 
set on which s,(0,) = 0 and 6, € A,. Using the mean-value theorem for 
vector-valued functions, we obtain that 


—8n(0) = ff Vsn(6 + t(6n — aya (6n — 0). 
Note that 
; i Vin (8+ t(m — 0) dt — Vsn(0| Zerg MPa MON 
NII Jo 


yEAe n 


Using the argument in proving (4.77) and the fact that P(@, € A.) > 1 
for arbitrary € > 0, we obtain that 


if Vsn(0 + tn — 0)) dt — V5n(0)| et 


1 
n 
Since n~!Vsp,(0) a.s, —l1(0) and I,(@) = nl, (8), 


—8(@) = —In(O)(On Gi Op (||In(8) (On = 0)|\). 


This and Slutsky’s theorem (Theorem 1.11) imply that \/7(6, — 9) has the 
same asymptotic distribution as 


VnlIn (9) Sn (8) = 27 /7[T 8)" 5n(0) a Ne (0, (14)]-*) 
by the CLT (Corollary 1.2), since Var(s,(@)) =In(0). 


Theorem 4.17(i) shows the asymptotic existence of a sequence of con- 
sistent RLE’s, and Theorem 4.17(ii) shows the asymptotic efficiency of any 
sequence of consistent RLE’s. However, for a given sequence of RLE’s, its 
consistency has to be checked unless the RLE’s are unique for sufficiently 
large n, in which case the consistency of the RLE’s is guaranteed by The- 
orem 4.17(i). 


292 4, Estimation in Parametric Models 


RLE’s are not necessarily MLE’s. We still have to use the techniques 
discussed in §4.4 to check whether an RLE is an MLE. However, according 
to Theorem 4.17, when a sequence of RLE’s is consistent, then it is asymp- 
totically efficient and, therefore, we may not need to search for MLE’s, if 
asymptotic efficiency is the only criterion to select estimators. The method 
of estimating 0 by solving s,(7) = 0 over 7 € O is called scoring and the 
function s,(7) is called the score function. 


Example 4.39. Suppose that X; has a distribution in a natural exponen- 
tial family, i.e., the p.d.f. of X; is 


fin(wi) = exp{n™ T (xa) — C(m) h(a). (4.79) 


Since 0? log f,(xi)/OnOn” = —O?¢(n)/OnOn", condition (4.69) is satisfied. 
From Proposition 3.2, other conditions in Theorem 4.16 are also satisfied. 
For i.i.d. X;’s, 


S8n(7) = > rox) - aa 


i=1 


If 6, =n-! 7", T(X;) € ©, the range of 6 = g(n) = AC(n)/An, then Oy, is 
a unique RLE of 0, which is also a unique MLE of 6 since 07¢(n)/On0n7 = 
Var(T(X;)) is positive definite. Also, 7 = g~1(0) exists and a unique RLE 
(MLE) of 17 is tin = 97! (On). 

However, bn may not be in © and the previous argument fails (e.g., 
Example 4.29). What Theorem 4.17 tells us in this case is that as n — oo, 
P(6, € ©) — 1 and, therefore, 6, (or fn) is the unique asymptotically 
efficient RLE (MLE) of 0 (or 7) in the limiting sense. 


In an example like this we can directly show that P(6, € @) > 1, using 
the fact that On a.s. E[T(X1)] = g(n) (the SLLN). 1 


The next theorem provides a similar result for the MLE or RLE in the 
GLM (84.4.2). 


Theorem 4.18. Consider the GLM (4.55)-(4.58) with ¢;’s in a fixed in- 
terval (to, too), 0 < to < too < oo. Assume that the range of the unknown 
parameter ( in (4.57) is an open subset of R?; at the true parameter value 
B, 0 < inf, y(G"Z,) < sup, g(8"Z;) < 00, where p(t) = [W/(H)P?2C"(W(t)): 
as m — 00, maxj<n Z7(Z7Z)~'Z; > 0 and A_[Z7Z] — oo, where Z is 
the n x p matrix whose ith row is the vector Z; and A_[A] is the smallest 
eigenvalue of the matrix A. P 

(i) There is a unique sequence of estimators {,} such that 


P(sn(Bn) =0) 71 and Bn —» 8, (4.80) 


4.5. Asymptotically Efficient Estimation 293 


where s,,(7) is the score function defined to be the left-hand side of (4.59) 
with y = f. 
(ii) Let [,(8) = Var(s,(3)). Then 


[In (B)]*/? (Bn — B) +a Np(0, Ip). (4.81) 


(iii) If @ in (4.58) is known or the p.d.f. in (4.55) indexed by 0 = (6, ¢) 
satisfies the conditions for fg in Theorem 4.16, then Bn is asymptotically 
efficient. : 

Proof. (i) The proof of the existence of 3,, satisfying (4.80) is the same as 
that of Theorem 4.17(i) with 6 = 6, except that we need to show 


max || En (2)? sn (7) En (B))-4/? + Ip|| 630) 
HEBL(C) 


where B,(c) = {7 : ||[In(8)]'/?2(7 — B)|| < c}. From (4.62) and (4.63), 
In(B) = Mn(8)/@ and V8n(7) = [Rn(y) — Mn(y)]/¢, where Mn(7) and 
R,(y) are defined by (4.60)-(4.61) with 7 = G. Hence, it suffices to show 
that for any c > 0, 


Ba, | Mn)? [Mn (9) — Mn (B)IMn(3)-"7 | +0 (4.82) 


and 


max, |[Ma(BI-VRn()Ma(QI-Y7 || +p 0. (4.83) 


The left-hand side of (4.82) is bounded by 


ieee ae TZ. 
VP _ mex | 9(7 Z:)/9(8" Zi) 


? 


which converges to 0 since y is continuous and, for 7 € B,(c), 


WZ —- PA? = \y- sy hOlALO)7 Z0? 

[Zn (B)]/? (7 — BI? In (8) /? Za? 

e max Z7 [In (B)J~* Zi 

< 2 ¢[to inf o(8" Z;)]* max Z](Z7Z)*Z; 
— 0 


under the assumed conditions. This proves (4.82). 
Let e; = Xi — w(Y(8" Z)), 


n 


Un(y) = > [wh(8" 2) — wv Za)" (Ziti Z7, 


i=l 


294 4, Estimation in Parametric Models 


Vily) = ps es[h (y7 Zi) — OB 2) ZiZ7, 
and 


Wr(B) = >> exh" (8 Ziti ZZ]. 
i=1 
Then Rn(7) = Un(y) + Vn(y) + Wn(G). Using the same argument as that 
in proving (4.82), we can show that 


max —1/2 -1/2|) _, g, 
mae, [LMa(B*?Un()Mn()J*|| > 0 


Note that || [,(8)]-1/?Vn(7)[Mn(8)]-1/?|| is bounded by the product of 


[Mn(8)]-¥? Ds leit: ZiZ7 [Mn (8)]-"/? = O,(1) 


and 
max |W"(y"Zi) — W"(8"Zi)| 


VEBn(c),i< 


which can be shown to be o(1) using the same argument as that in proving 
(4.82). Hence, 


ues, [ILM (IVa) [Ma(8)I-"/?|| sp 0 


and (4.83) follows from 


Mn (7? Wa (B)[Mn(B)-*/? || ep 0. 


To show this result, we apply Theorem 1.14(ii). Since E(e;) = 0 and e,’s 
are independent, it suffices to show that 


So Blew" (8 Zi) Z7 (Mn(B))1Zi|"¥ 


i=1 


0 (4.84) 


for some 6 € (0,1). Note that sup; Ele;|'+° < oo. Hence, there is a constant 
C > 0 such that the left-hand side of (4.84) is bounded by 


OS |2e2-2\" = pC max|Z7(Z"Z)-1Z,l° — 0. 


Hence, (4.84) follows from Theorem 1.14(ii). This proves (4.80). The 
uniqueness of @,, follows from (4.83) and the fact that M/,,(7) is positive 
definite in a neighborhood of 3. This completes the proof of (i). 


4.5. Asymptotically Efficient Estimation 295 


(ii) The proof of (ii) is very similar to that of Theorem 4.17(ii). Using the 
results in the proof of (i) and Taylor’s expansion, we can establish (exercise) 
that 


[In(8)]/? (Bu — 8) = [In (8) /? 8 (8) + op(1). (4.85) 

Using the CLT (e.g., Corollary 1.3) and Theorem 1.9(iii), we can show 
(exercise) that 

[in(B))- 7? 8n (8) a Np(0, Ip). (4.86) 


Result (4.81) follows from (4.85)-(4.86) and Slutsky’s theorem. 
(iii) The result is obvious if ¢ is known. When ¢ is unknown, it follows 


from (4.59) that 
2 (iat) 2 288 
d¢| ap o- 
Since E[s,,(3)] = 0, the Fisher information about 6 = ((, d) is 


In(8,¢) =—E eee = ( oe iw) : 


where [,,(@) is the Fisher information about ¢. The result then follows 
from (4.81) and the discussion in the end of §4.5.1. I 


4.5.3 Other asymptotically efficient estimators 


To study other asymptotically efficient estimators, we start with MRIE’s in 
location-scale families. Since MLE’s and RLE’s are invariant (see Exercise 
109 in §4.6), MRIE’s are often asymptotically efficient; see, for example, 
Stone (1974). 

Assume the conditions in Theorem 4.16 and let s,,(y) be the score func- 
tion. Let 600) be an estimator of # that may not be asymptotically efficient. 
The estimator 

AY = 6) — [Vn (60)] > sn (6) (4.87) 
is the first iteration in computing an MLE (or RLE) using the Newton- 
Raphson iteration method with 6°) as the initial value (see (4.53)) and, 
therefore, is called the one-step MLE. Without any further iteration, 6D) 
can be used as a numerical approximation to an MLE or RLE; and 6) 
is asymptotically efficient under some conditions, as the following result 
shows. 


Theorem 4.19. Assume that the conditions in Theorem 4.16 hold and 
that 60) is \/n-consistent for 0 (Definition 2.10). 

(i) The one-step MLE 6 is asymptotically efficient. 

(ii) The one-step MLE obtained by replacing Vs,(y) in (4.87) with its 


296 4, Estimation in Parametric Models 


expected value, —I,,(7) (the Fisher-scoring method), is asymptotically effi- 
cient. 

Proof. Since 6) is ,/n-consistent, we can focus on the event 6) E€ A= 
{7 : |ly — 6|| < €} for a sufficiently small « such that A, C ©. From the 
mean-value theorem, 


8n(9) = o+| [ Vsn(0 + t(8O — aya (6 — @). 
Substituting this into (4.87) we obtain that 
On — 8 = —[Vsn(O)}*5n(8) + Uk — Gn (OD 16D? — 8), 
where 


Gn (8) = [Vsn(8)]-? Von 6+ t(6 — 6)) dt. 
( ) 


From (4.77), ||[Zn(0)}!/2[V sn (6m )]-! En (0)]+/2 + Lal —, 0. Using an argu- 
ment similar to those in the proofs of (4.77) and (4. 82), we can show that 
1G,(6—) —Ix|| +, 0. These results and the fact that Vn(6o —6) =O,(1) 


imply ; 
Vn, — 8) = VnlIn(9)]-*8n(8) + op (1). 
This proves (i). The proof for (ii) is similar. 
Example 4.40. Let Xj,...,X, be i.i.d. from the Weibull distribution 
W(6,1), where 6 > 0 is unknown. Note that 


n 


+ log X; — S > Xf log X; 


i=l 


ss 


and . 
n 
V8n(9) = — a5 — S > X? (log X;)?. 
i=1 
Hence, the one-step MLE of 0 is 
A(0 
6 = 60 |14 wh Mat oes — Ta A eX] 
+ (8 Bn”)? ae ae (log X;)? 


Usually one can use a moment estimator (§3.5.2) as the initial estimator 


6, In this example, a moment estimator of @ is the solution of X = 
r(@-++1). 8 


Results similar to that in Theorem 4.19 can be obtained in non-i.i.d. 
cases, for example, the GLM discussed in §4.4.2 (exercise); see also §5.4. 


4.5. Asymptotically Efficient Estimation 297 


As we discussed in §4.1.3, Bayes estimators are usually consistent. The 
next result, due to Bickel and Yahav (1969) and Ibragimov and Has’minskii 
(1981), states that Bayes estimators are asymptotically efficient when X;’s 
are iid. 


Theorem 4.20. Assume the conditions of Theorem 4.16. Let 7(7) be a 
prior p.d.f. (which may be improper) w.r.t. the Lebesgue measure on © and 
Pn(y) be the posterior p.d.f., given X1,..., Xn, m = 1,2,..... Assume that 
there exists an no such that p,,, (7) is continuous and positive for all y € 0, 
f Pno(ydy = 1 and f |ly|lPno(vdy < co. Suppose further that, for any 
€ > 0, there exists a 6 > 0 such that 


1 — log £(9 
tial i. ape eee NO) a | a (4.88) 
noo \ yal n 
and 
ime |. Sup, eS Vel) 6 (4.89) 
noo \ Iy-ali<6 n 


where ¢(7) is the likelihood function and s,,(y) is the score function. 

(i) Let p*(y) be the posterior p.d-f. of /n(y — Tn), where T, = 9+ 
(In(0)|~+s,(@) and @ is the true parameter value, and let (7) be the p.d-f. 
of N;(0, [1(0)|-+). Then 


fo + hl) 


(ii) The Bayes estimator of 6 under the squared error loss is asymptotically 
efficient. I 


pn (y) — (7) |dy > 9. (4.90) 


The proof of Theorem 4.20 is lengthy and is omitted; see Lehmann 
(1983, 86.7) for a proof of the case of univariate @. 


A number of conclusions can be drawn from Theorem 4.20. First, result 
(4.90) shows that the posterior p.d.f. is approximately normal with mean 
6 + [In(0)|~18n(0) and covariance matrix [I,(0)|~!. This result is useful 
in Bayesian computation; see Berger (1985, §4.9.3). Second, (4.90) shows 
that the posterior distribution and its first-order moments converge to the 
degenerate distribution at @ and its first-order moments, which implies the 
consistency and asymptotic unbiasedness of Bayes estimators such as the 
posterior means. Third, the Bayes estimator under the squared error loss is 
asymptotically efficient, which provides an additional support for the early 
suggestion that the Bayesian approach is a useful method for generating 
estimators. Finally, the results hold regardless of the prior being used, 
indicating that the effect of the prior declines as n increases. 


298 4, Estimation in Parametric Models 


In addition to the regularity conditions in Theorem 4.16, Theorem 4.20 
requires two more nontrivial regularity conditions, (4.88) and (4.89). Let us 
verify these conditions for natural exponential families (Example 4.39), i-e., 
X;’s are iid. with p.d-f. (4.79). Since Vsn(7) = —nd?¢(n)/OnOn7, (4.89) 
follows from the continuity of the second-order derivatives of ¢. To show 
(4.88), consider first the case of univariate 7. Without loss of generality, 
we assume that y > 7. Note that 


log £(7) — log €(77) 


n 


C(y) = ¢(n) 
yn 


=|? C(n) + (n) | o-», (4.91) 


where T is the average of T(X;)’s. Since C(y) is strictly convex, y > 


implies ¢’(7) < [¢(y) — C(mI/(y — 9). Also, T >a.s. 6’(7). Hence, with 
probability tending to 1, the factor in front of (y — 7) on the right-hand 


side of (4.91) is negative. Then (4.88) holds with 


2 y>n+e Y-7 
To show how to extend this to multivariate 7, consider the case of bivariate 
n. Let nj, 7j, and €; be the jth components of 7, y, and T — V¢(n), 
respectively. Assume y; > and 72 > m2. Let ¢i be the derivative of ¢ 
w.r.t. the jth component of 7. Then the left-hand side of (4.91) is the sum 


of 

(ya — m)&1 — [6(m, 72) — 6m, m2) — (92 — 92)63(m, n2)] 
and 

(y2 — n2)€a — [C(v15 2) — (m2) — (41 — mS (m,n), 
where the last quantity is bounded by 


(ya — na)€2 — [C(11, 72) — C(m, V2) — (11 — 1) (mM, 72)I, 


since ¢1 (1,2) < ¢i(m, 72). The rest of the proof is the same as the case 
of univariate 7. 


When Bayes estimators have explicit forms under a specific prior, it 
is usually easy to prove the asymptotic efficiency of the Bayes estimators 
directly. For instance, in Example 4.7, the Bayes estimator of 6 is 


X+yi SMe QS ly XS 1 
EEE se ecg I NOES eat hare 
n+a-1l nt+ta-1l n 


where X is the MLE of 6. Hence the Bayes estimator is asymptotically 
efficient by Slutsky’s theorem. A similar result can be obtained for the 
Bayes estimator 6,(X) in Example 4.7. Theorem 4.20, however, is useful in 
cases where Bayes estimators do not have explicit forms and/or the prior 
is not specified clearly. One such example is the problem in Example 4.40 
(Exercises 153 and 154). 


4.6. Exercises 299 


4.6 Exercises 


1. Show that the priors in the following cases are conjugate priors: 
(a) Xj,..., Xn are iid. from N;(0,1,), 0 € R*, and II = Nz(p0, Xo) 
(Normal family); 
(b) X1,..., Xp» are iid. from the binomial distribution Bi(0,k), 0 € 
(0,1), and II = B(a, 3) (Beta family); 
(c) X1,..., Xn are iid. from the uniform distribution U(0, 6), 6 > 0, 
and II = Pa(a,b) (Pareto family); 
(d) X4,...,X, are i.id. from the exponential distribution E(0, 6), 6 > 
0, II = the inverse gamma distribution [~!(a,) (a random variable 
Y has the inverse gamma distribution [~!(a,) if and only if Y~+ 
has the gamma distribution I'(a,7)). 
(e) X1,..., Xp are iid. from the exponential distribution E(6,1), 6 € 
R, and II has a Lebesgue p.d.f. a ks as fame (@),aE R,b>0. 


2. In Exercise 1, find the posterior mean and variance for each case. 


3. Let X1,..., Xn be iid. from the N(6,1) distribution and let the prior 
be the double exponential distribution DE(0, 1). Obtain the posterior 
mean. 


4. Let X1,...,X» be iid. from the uniform distribution U(0,6), where 
# > 0 is unknown. Let the prior of @ be the log-normal distribution 
LN(uo, 02), where fo € R and go > 0 are known constants. 

(a) Find the posterior p.d-f. of 0 = log 0. 
(b) Find the rth posterior moment of 0. 
(c) Find a value that maximizes the posterior p.d.f. of 6. 


5. Show that if T(X) is a sufficient statistic for 0 € ©, then the Bayes 
action 6(a) in (4.3) is a function of T(z). 


6. Let X be the sample mean of n i.i.d. observations from N(6,07) with 
a known o > 0 and an unknown 6 € R. Let 7(@) be a prior p.d-f. 
w.r.t. a o-finite measure on R. 

(a) Show that the posterior mean of 0, given X = 2, is of the form 


a dlog(p(2)) 
where p(x) is the marginal p.d.f. of X, unconditional on 0. 
(b) Express the posterior variance of @ (given X = x) as a function 
of the first two derivatives of log(p(a)) w.r.t. x. 
(c) Find explicit expressions for p(x) and 6(x) in (a) when the prior 
is N(419,0@) with probability 1—« and a point mass at 4, with prob- 
ability €, where jig, f41, and o2 are known constants. 


300 


7. 


10. 


11. 


12. 


4, Estimation in Parametric Models 


Let X1,...,X» be iid. binary random variables with P(X; = 1) = 
p € (0,1). Find the Bayes action w.r.t. the uniform prior on [0, 1] in 
the problem of estimating p under the loss L(p, a) = (p—a)?/[p(1—p)]. 


. Consider the estimation of # in Exercise 41 of §2.6 under the squared 


error loss. Suppose that the prior of @ is the uniform distribution 
U(0, 1), the prior of j is P(j = 1) = P(j = 2) = $, and the joint prior 
of (6,7) is the product probability of the two marginal priors. Show 


that the Bayes action is 
Ree A(x)B(t+1)+ G(t+1) 
A(x)Bt)+G(t)  ” 
where w = (21,...,2,) is the vector of observations, t = #1 +---+2n, 
B(t) = fy 6°(1—0)"-'dd, G(t) = f, 6'e-"d0, and H(x) is a function 
of x with range {0, 1}. 


. Consider the estimation problem in Example 4.1 with the loss function 


L(0,a) = w(9)[9(0) —a}*, where w(0) > 0 and f, w()[g(8)]?dII < 00. 
Show that the Bayes action is 


as Jeo (9)9(9) fo(a) all 
Jo WO fo(x)dIL - 


Let X be a sample from Py, 0€ O C R. Consider the estimation of 0 
under the loss L(|@— al), where L is an increasing function on [0, co). 
Let 7(@|x) be the posterior p.d.f. of 6 given X = x. Suppose that 
m(0|x) is symmetric about 6(x) € O and that 7(6|x) is nondecreasing 
for 9 < d(a#) and nonincreasing for @ > d6(#). Show that 6(x) is a 
Bayes action, assuming that all integrals involved are finite. 


Let X be a sample of size 1 from the geometric distribution G(p) with 
an unknown p € (0,1]. Consider the estimation of p with A = (0, 1] 
and the loss function L(p,a) = (p — a)?/p. 

(a) Show that 6 is a Bayes action w.r.t. Il if and only if d(x) = 
1— [(L=p)*all(p)/ f(1 — p)® dlp), @ = 1,2... 

(b) Let 69 be a rule such that 59(1) = 1/2 and do(x) = 0 for all x > 1. 
Show that do is a limit of Bayes actions. 

(c) Let 69 be a rule such that do(#) = 0 for all x > 1 and do(1) is 
arbitrary. Show that 69 is a generalized Bayes action. 


Let X be a single observation from N(y,07) with a known o? and 
an unknown yu > 0. Consider the estimation of 4 under the squared 
error loss and the noninformative prior II = the Lebesgue measure 
on (0,00). Show that the generalized Bayes action when X = = is 
d(a) = « + o®'(x/o)/[1 — ®(-—2/c)], where ® is the c.d.f. of the 
standard normal distribution and ©®’ is its derivative. 


4.6. Exercises 301 


13. 


14. 


15. 


16. 


17. 


18. 


Let X be a sample from Py having the p.d.f. h(x) exp{@7x — ¢(6)} 
w.r.t. v. Let Il be the Lebesgue measure on O = R?. Show that 
the generalized Bayes action under the loss L(@, a) = ||E(X) — all? is 
o(a) =a when X =z. 


Let Xy,...,X, be ii.d. random variables with the Lebesgue p.d.f. 
J2/ ne @-9)"/? Ie (2), where 6 € R is unknown. Find the gen- 
eralized Bayes action for estimating 6 under the squared error loss, 
when the (improper) prior of @ is the Lebesgue measure on FR. 


Let X1,...,Xn be iid. from N(u, 07) and (p,07) = 07 710,00) (07) 
be an improper prior for (1,07) w.r.t. the Lebesgue measure on R?. 
(a) Show that the posterior p.d.f. of (u,0?) given x = (#1,...,%n) is 
m(p,07|2) = m1 (ulo?,x)72(o?|x), where m(y\o?, x) is the p.d.f. of 
N(Z,07/n) and 72(07|x) is the p.d-f. of the inverse gamma distribu- 
tion [~1((n — 1)/2, [D0"_, (wi — Z)?/2]~') (see Exercise 1(d)). 

(b) Show that the marginal posterior p.d.f. of y given x is f(4=*), 
where 7? = S7"_, (a; — Z)?/[n(n — 1)] and f is the p.d.f. of the t- 
distribution ty_}. 

(c) Obtain the generalized Bayes action for estimating y/o under the 
squared error loss. 


Consider Example 3.13. Under the squared error loss and the prior 
with the improper Lebesgue density (11, ...,’m,07) = 07, obtain 
the generalized Bayes action for estimating 0 = 07? 37", ni(wi— fA)’, 
where f= 771 0 nisi. 


Let X be a single observation from the Lebesgue p.d_f. COPE Tig ea) (x), 
where 6 > 0 is an unknown parameter. Consider the estimation of 


o={ 4 ¢6€G-1,3], 9 =1,2,8, 


4 O>3 
under the loss L(t, 7), 1 < i,7 < 4, given by the following matrix: 
0 1 1 2 
102 2 
1 2 0 2 
3 3 3:0 


When X = 4, find the Bayes action w.r.t. the prior with the Lebesgue 
p.d.f. e°Io,00) (0). 


(Bayesian hypothesis testing). Let X be a sample from Ps, where 
d€ 0. Let Oo C O and O; = O§, the complement of Oo. Consider 
the problem of testing Hp : 6 € Oo versus H, : 6 € ©; under the loss 


0 GEO; 


302 


19. 


20. 


21. 


22. 


23. 


24. 


4, Estimation in Parametric Models 


where C;, > 0 are known constants and {ao, a1} is the action space. 
Let Ig), be the posterior distribution of @ w.r.t. a prior distribution 
II, given X = x. Show that the Bayes action 6(x) = a, if and only if 
To|2(O1) = Ci/(Co + C1). 


In (b)-(d) of Exercise 1, assume that the parameters in priors are 
unknown. Using the method of moments, find empirical Bayes actions 
under the squared error loss. 


In Example 4.5, assume that both jo and o@ in the prior for ju are 
unknown. Let the second-stage joint prior for (9,0) be the prod- 
uct of N(a,v2) and the Lebesgue measure on (0,00), where a and v 
are known. Under the squared error loss, obtain a formula for the 
hierarchical Bayes action in terms of a one-dimensional integral. 


Let Xj,...,X, be iid. random variables from the uniform distribu- 
tion U(0, 0), where 6 > 0 is unknown. Let (0) = ba?@~ +) I. 0) (8) 
be a prior p.d.f. w.r.t. the Lebesgue measure, where b > 1 is known 
but a@ > 0 is an unknown hyperparameter. Consider the estimation 
of 6 under the squared error loss. 

(a) Show that the empirical Bayes method using the method of 
moments produces the empirical Bayes action 6(@), where 6(a) = 
ey max{a, X(n)}, @= 20-1) doy Xi, and X(,) is the largest or- 
der statistic. 

(b) Let h(a) = a~'Io,.0)(a) be an improper Lebesgue prior density 
for a. Obtain explicitly the hierarchical generalized Bayes action. 


Let X be a sample and 6(X) with any fixed X = 2 € A be a Bayes 
action, where 6 is a measurable function and {, P9(A)dIl = 1. Show 
that 6(X) is a Bayes rule as defined in §2.3.2. 


Let Xy,...,Xn be ii.d. random variables with the Lebesgue p.d.f. 
fo(a) = /20/re~8"/2 To, 40) (2), where @ > 0 is unknown. Let the 
prior of 0 be the gamma distribution ['(a,y) with known a and y. 
Find the Bayes estimator of f9(0) and its Bayes risk under the loss 
function L(0,a) = (a — 0)?/0. 


Let X be asingle observation from N (0, 97) and consider a prior p.d.f. 
me(0) = cla, u,7)|9|-%e- @ '-)"/ 27") wart. the Lebesgue measure, 
where € = (a, 4,7) is a vector of hyperparameters and c(a, 1,7) en- 
sures that 7¢(9) is a p.d.f. 

(a) Identify the constraints on the hyperparameters for 7¢(@) to be a 
proper prior. 

(b) Show that the posterior p.d.f. is me, (@) for given X = x and iden- 
tify &,. 


4.6. Exercises 303 


25. 


26. 
27. 


28. 


29. 


30. 


(c) Express the Bayes estimator of |9| and its Bayes risk in terms of 
the function c and , and state any additional constraints needed on 
the hyperparameters. 


Let Xy, Xo,... be i.i.d. from the exponential distribution E(0, 1). Sup- 
pose that we observe T = X, +.---+ Xo, where @ is an unknown 
integer > 1. Consider the estimation of @ under the loss function 
L(0,a) = (@—a)?/@ and the geometric distribution G(p) as the prior 
for 6, where p € (0,1) is known. 

(a) Show that the posterior expected loss is 


E[L(6,a)|T = t] =14+ €-2a+ (1—e7$)a?/é, 


where € = (1 — p)t. 

(b) Find the Bayes estimator of 6 and show that its posterior expected 
loss is 1-— € 7 _, e7™S. 

(c) Find the marginal distribution of (1 — p)T, unconditional on 6. 
(d) Obtain an explicit expression for the Bayes risk of the Bayes 
estimator in part (b). 


Prove (ii) and (iii) of Theorem 4.2. 


Let X1,...,X» be iid. binary random variables with P(X; = 1) = 
pé€ (0,1). 

(a) Show that X is an admissible estimator of p under the loss function 
(a — p)?/[p(1 — p)). 

(b) Show that X is an admissible estimator of p under the squared 
error loss. 


Let X be a sample (of size 1) from N(y,1). Consider the estimation 
of ys under the loss function L(y,a) = | — al. Show that X is an 
admissible estimator. 


In Exercise 1, consider the posterior mean to be the Bayes estimator 
of the corresponding parameter in each case. 

(a) Show that the bias of the Bayes estimator converges to 0 ifn — oo. 
(b) Show that the Bayes estimator is consistent. 

(c) Discuss whether the Bayes estimator is admissible. 


Let X1,...,X» be iid. binary random variables with P(X, = 1) = 
pé€ (0,1). 

(a) Obtain the Bayes estimator of p(1 — p) w.r.t. II = the beta distri- 
bution B(a, 3) with known a and (3, under the squared error loss. 
(b) Compare the Bayes estimator in part (a) with the UMVUE of 
p(l—p). 

(c) Discuss the bias, consistency, and admissibility of the Bayes esti- 
mator in (a). 


304 


31. 


32. 


33. 


34. 


35. 


36. 


4, Estimation in Parametric Models 


(d) Let (p) = [p(1 — p)|~*Zo,1)(p) be an improper Lebesgue prior 
density for p. Show that the posterior of p given X;’s is a p.d.f. pro- 
vided that the sample mean X € (0, 1). 

(e) Under the squared error loss, find the generalized Bayes estimator 
of p(1 — p) w.r.t. the improper prior in (d). 


Let X be an observation from the negative binomial distribution 
NB(p,r) with a known r and an unknown p € (0,1). 

(a) Under the squared error loss, find Bayes estimators of p and p7 
w.r.t. II = the beta distribution B(a, 3) with known a and £. 

(b) Show that the Bayes estimators in (a) are consistent as r — oo. 


1 


In Example 4.7, show that 

(a) X is the generalized Bayes estimator of 6 w.r.t. the improper 
prior 44 = I,.)(w) and is a limit of Bayes estimators (as a — 1 
and y — co); 

(b) under the squared error loss for estimating 0, the Bayes estimator 
(nX + ~1)/(n+a—1) is admissible, but the limit of Bayes estimators, 


nX /(n+a—1) with an a ¥ 2, is inadmissible. 


Consider Example 4.8. Show that the sample mean X is a generalized 
Bayes estimator of j under the squared error loss and X is admissible 
using (a) Theorem 4.3 and (b) the result in Example 4.6. 


Let X be an observation from the gamma distribution ['(a, @) with a 
known a and an unknown 6 > 0. Show that X/(a+1) is an admissible 
estimator of 6 under the squared error loss, using Theorem 4.3. 


Let X1,..., Xp be iid. from the uniform distribution U(0,6+1), 6 € 
R. Consider the estimation of 6 under the squared error loss. 

(a) Let 7(0) be a continuous and positive Lebesgue p.d.f. on R. Derive 
the Bayes estimator w.r.t. the prior 7 and show that it is a consistent 
estimator of 6. 

(b) Show that (X(1) + Xm) — 1)/2 is an admissible estimator of 6 and 
obtain its risk, where X(;) is the jth order statistic. 


Consider the normal linear model X = N,(ZG,07I,), where Z is an 
n x p known matrix of full rank, p <n, 8 € R”, and o? > 0. 

(a) Assume that o? is known. Derive the posterior distribution of 3 
when the prior distribution for B is N,(Go,07V), where 8 € R? is 
known and V is a known positive definite matrix, and find the Bayes 
estimator of /7@ under the squared error loss, where | € R? is known. 
(b) Show that the Bayes estimator in (a) is admissible and consistent 
as n — oo, assuming that the minimum eigenvalue of Z7Z — oo. 

(c) Repeat (a) and (b) when o? is unknown and has the inverse gamma 
distribution [~'(a,y) (see Exercise 1(d)), where a and ¥ are known. 


4.6. Exercises 305 


37. 


38. 


39. 


(d) In part (c), obtain Bayes estimators of 0? and 173/o under the 
squared error loss and show that they are consistent under the con- 
dition in (b). 


In Example 4.9, suppose that ¢;; has the Lebesgue p.d.f. 
K(8)o; exp { —c(5)|a/oi/04 k, 


where 


r(sagen)] _ [e(se2)]"” 
[os $0) = cere 


-1<d<1lando;>0. 

(a) Assume that 6 is known. Let w; = e(6)o;, 2/049), Under the 
squared error loss and the same prior in Example 4.9, show that the 
Bayes estimator of 0? is 


1+6 


ie. oe 
a6) | a+ Yo ley SZC) F(6 le, 8), 
j=l 


where gj(5) = [e(d)]!+°P (42n; + a — 6) /T (Ken; +a+1) and 


‘s —(a+14+ 168 n;) 


1 ae 
f (lz, 6) «x (8 TI = 4S foray — 87 Z,(2/0+® 


i=1 j=l 


_ 


(b) Assume that 6 has a prior p.d.f. f(6) and that given 6, w; still 
has the same prior in (a). Derive a formula (similar to that in (a)) 
for the Bayes estimator of o?. 


Suppose that we have observations 
Xi; = i + E4j, al Peery so FH 1m 


where ¢;;’s are iid. from N(0, 02), pj's are iid. from N(p,0 on); ae 
€4j;’°8 and p;’s are independent. Suppose that the distribution for 0? 
is the inverse gamma distribution !~!(a1, 31) (see Exercise 1(d)); the 
distribution for 07, is the inverse gamma distribution I~! (az, 32); the 
distribution for is N(uo,08); and o2, 0, and pw are independent. 
Describe a Gibbs sampler and obtain expel forms of 

(a) the distribution of , given X;;’s, p's, Te , and Tn 
(b) the distribution of Mis given Xj;’s, |, 7 com Ce, 
(c) the distribution of Tes given Xj;’s, Li’, a and 02 
(d) the distribution of of, given Xj;’s, 14s, 4, and 02 


Prove (4.16). 


o 


306 


40. 


Al. 
42. 


43. 
4A. 


45. 


A6. 
AT. 
A8. 


49. 


50. 


51. 


4, Estimation in Parametric Models 


Consider a Lebesgue p.d.f. p(y) « (2+y)'*?(1—y)*8y*4Io,1) (y). Gen- 
erate Markov chains of length 10,000 and compute approximations to 
J yp(y)dy, using the Metropolis kernel with q(y, z) being the p.d.f. of 
N(y,r7), given y, where (a) r = 0.001; (b) r = 0.05; (c) r = 0.12. 


Prove Proposition 4.4 for the cases of variance and risk. 


In the proof of Theorem 4.5, show that if L is (strictly) convex and 
not monotone, then E[L(To(a“) — a)|D = dj is (strictly) convex and 
not monotone in a. 


Prove part (iii) of Theorem 4.5. 


Under the conditions of Theorem 4.5 and the loss function L(y, a) = 
| —a|, show that u.(d) in Theorem 4.5 is any median (Exercise 92 in 
$2.6) of To(X) under the conditional distribution of X given D = d 
when ps = 0. 


Show that if there is a location invariant estimator To of jz with finite 
mean, then Eo[T'(X)|D = d] is finite a.s. P for any location invariant 
estimator T’. 


Show (4.21) under the squared error loss. 
In Exercise 14, find the MRIE of @ under the squared error loss. 


In Example 4.12, 

(a) show that X(1) — @log2/n is an MRIE of yu under the absolute 
error loss L(j — a) = | — al; 

(b) show that X(;) —t is an MRIE under the loss function L(—a) = 
T(t,00) (| — al). 

In Example 4.13, show that 7), is also an MRIE of p if the loss function 


is convex and even. (Hint: the distribution of T,.(X) given D depends 
only on X(,) — X(1) and is symmetric about 0 when p = 0.) 


Let Xy,...,X, be iid. from the double exponential distribution 
DE(u,1) with an unknown p € R. Under the squared error loss, 
find the MRIE of w. (Hint: for 71 <--- < a, and a, <t < Up41, 


n n k 
Dini [ei — te] = pe Ta Li — Doan i + (2k — nt.) 
In Example 4.11, find the MRIE of yw under the loss function 
—a(u-a) <a 
L(u-a) = 
eae ees >a, 


where a and ( are positive constants. (Hint: show that if Y is a 
random variable with c.d.f. F', then E[L(Y — u)] is minimized for any 
u satisfying F(u) = B/(a t+ @).) 


4.6. Exercises 307 


52. 


53. 


54. 
55. 


56. 


57. 


58. 


59. 


60. 


61. 
62. 


Let T be a location invariant estimator of w in a one-parameter lo- 
cation problem. Show that T’ is an MRIE under the squared error 
loss if and only if T is unbiased and E[T(X)U(X)] = 0 for any U(X) 
satisfying U(a1 +c¢,...,% +c) = U(x) for any c, E[U(X)| = 0 for any 
ps, and Var(U) < oo. 


Assume the conditions in Theorem 4.6. Let T be a sufficient statistic 
for u. Show that Pitman’s estimator is a function of T. 


Prove Proposition 4.5, Theorems 4.7 and 4.8, and Corollary 4.1. 


Under the conditions of Theorem 4.8 and the loss function (4.24) with 
p =1, show that u.(z) is any constant c > 0 satisfying 


[cares = [ dP, z, 
0 c 


where P,), is the conditional distribution of X given Z = z when 
o=1. 


In Example 4.15, show that the MRIE is QOD Xs when the loss 
is given by (4.24) with p = 1. 


Let X1,...,X» be iid. from the exponential distribution E(0,@) with 
an unknown 6 > 0. 

(a) Find the MRIE of 6 under the loss (4.24) with p = 2. 

(b) Find the MRIE of @ under the loss (4.24) with p = 1. 

(c) Find the MRIE of 6? under the loss (4.24) with p = 2. 


Let X1,..., Xn be iid. with a Lebesgue p.d.f. (2/0)[1—(#/o)|I(0,0) (2), 
where o > 0 is an unknown scale parameter. Find Pitman’s estimator 
of o” for n = 2,3, and 4. 


Let X4,...,Xn be iid. from the Pareto distribution Pa(c, a), where 
o > 0 is an unknown parameter and a > 2 is known. Find the MRIE 
of o under the loss function (4.24) with p = 2. 


Assume that the sample X has a joint Lebesgue p.d.f. given by (4.25). 
Show that a loss function for the estimation of yw is invariant under 
the location-scale transformations ge,(X) = (rX1 + ¢,...,7Xn +0), 
r>0,cé€R, if and only if it is of the form L (=). 


oO 


Prove Proposition 4.6, Theorem 4.9, and Corollary 4.2. 


Let X1,...,Xn be ii.d. from the exponential distribution E(y,0o), 
where p € R and o > 0 are unknown. 

(a) Find the MRIE of o under the loss (4.24) with p = 1 or 2. 

(b) Under the loss function (a — )?/o7, find the MRIE of p. 

(c) Compute the bias of the MRIE of p in (b). 


308 


63. 


64. 


65. 


66. 


67. 


4, Estimation in Parametric Models 


Suppose that X and Y are two samples with p.d.f. given by (4.30). 
(a) Suppose that 2. = fy = 0 and consider the estimation of 7 = 
(oy/or)" with a fixed h 4 0 under the loss L(a/n). Show that the 
problem is invariant under the transformations g(X,Y) = (rxX,7’Y), 
r >0,7’ > 0. Generalize Proposition 4.5, Theorem 4.8, and Corollary 
4.1 to the present problem. 

(b) Generalize the result in (a) to the case of unknown ju, and jy 
under the transformations in (4.31). 


Under the conditions of part (a) of the previous exercise and the loss 
function (a —1)?/7?, determine the MRIE of 77 in the following cases: 
(a) m = n= 1, X and Y are independent, X has the gamma dis- 
tribution T'(@,,y) with a known a, and an unknown ¥ = o,; > 0, 
and Y has the gamma distribution I'(a,,y) with a known a, and an 
unknown y = oy, > 0; 

(b) X is Nm(0,0%m), Y is Nn(0,07I,), and X and Y are indepen- 
dent; 

(c) X and Y are independent, the components of X are i.i.d. from 
the uniform distribution U(0,0¢,), and the components of Y are i.i.d. 
from the uniform distribution U(0, oy). 


Let Xj,..., Xm and Yj,..., Y, be two independent samples, where X;’s 
are iid. having the p.d-f. oz! f (S4) with pa, € R and oy > 0, and 


Y;’s are iid. having the p.d.f. off (S") with p, € Rando, > 0. 
Under the loss function (a—1)?/n? and the transformations in (4.31), 
obtain the MRIE of 7 = o,/o, when 

(a) f is the p.d.f. of N(0, 1); 

(b) f is the p.d.f. of the exponential distribution E(0, 1); 

(c) f is the p.d-f. of the uniform distribution U (-$, $ ; 

(d) In (a)-(c), find the MRIE of A = py, — us under the assumption 
that 0, = oy =o and under the loss function (a — A)?/o?. 


Consider the general linear model (3.25) under the assumption that 
e;’s are iid. with the p.d.f. o~' f(x/o), where f is a known Lebesgue 
p.d.f. 

a) Show that the family of populations is invariant under the trans- 
formations in (4.32). 

b) Show that the estimation of [76 with | € R(Z) is invariant under 


the loss function L (£2). 


c) Show that the LSE I’ is an invariant estimator of [73,1 € R(Z). 
d) Prove Theorem 4.10. 


In Example 4.18, let T’ be a randomized estimator of p with probabil- 
ity n/(n + 1) being X and probability 1/(n + 1) being 4. Show that 


4.6. Exercises 309 


68. 


69. 


70. 


71. 


72. 


73. 


74. 


T has a constant risk that is smaller than the maximum risk of X. 


Let X be a single sample from the geometric distribution G(p) with 
an unknown p € (0,1). Show that I;;}(X) is a minimax estimator of 
p under the loss function (a — p)?/[p(1 — p)]. 


In Example 4.19, show that X is a minimax estimator of y under the 
loss function (a — )?/o0? when © = R x (0,00). 


Let T be a minimax (or admissible) estimator of 0 under the squared 
error loss. Show that c;T'+ co is a minimax (or admissible) estimator 
of cjv+cp under the squared error loss, where c; and co are constants. 


Let X be a sample from P, with an unknown 0 = (61, 02), where 0; € 
0;, 7 = 1,2, and let Ilz be a probability measure on Og. Suppose that 
an estimator Tp minimizes supg,co, { Rr(A)dI2(02) over all estima- 
tors T and that supy,co, J Rr (A)dIl2(02) = supy,co,9,ce, Rm (9). 
Show that Jo is a minimax estimator. 


Let X1,..., Xm be iid. from N(1z,07) and Yj,..., Yn be iid. from 
N(uy,o%). Assume that X;’s and Y;’s are independent. Consider the 
estimation of A = fy — wz under the squared error loss. 

(a) Show that Y — X is a minimax estimator of A when o, and oy 
are known, where X and Y are the sample means based on X;’s and 
Y;’s, respectively. 

(b) Show that Y — X is a minimax estimator of A when oy € (0, cx] 
and oy € (0,cy], where c, and c, are constants. 


Consider the general linear model (3.25) with assumption Al and the 
estimation of /7@ under the squared error loss, where | € R(Z). Show 
that the LSE 172 is minimax if 0? € (0,c] with a constant c. 


Let X be a random variable having the hypergeometric distribution 
HG(r,0,N — 0) (Table 1.1, page 18) with known N and r but an 
unknown @. Consider the estimation of 6/N under the squared error 
loss. 

(a) Show that the risk function of T(X) = aX/r + @ is constant, 
where a = {1+ ./(N —1)/[r(N — 1)]}71 and @ = (1—-a)/2. 

(b) Show that T in (a) is the minimax estimator of 0/N and the Bayes 
estimator w.r.t. the prior 


T({0}) = aon | Cac —#)N-Sre-1 a, 9 =1,..., N, 


where c = B/(a/r —1/N). 


310 


795. 


76. 


77. 


78. 


79. 


80. 


81. 


82. 


83. 


4, Estimation in Parametric Models 


Let X be a single observation from N(,1) and let w have the im- 
proper Lebesgue prior density m(u) = e“. Under the squared error 
loss, show that the generalized Bayes estimator of y is X + 1, which 
is neither minimax nor admissible. 


Let X be a random variable having the Poisson distribution P(@) with 
an unknown @ > 0. Consider the estimation of 9 under the squared 
error loss. 

(a) Show that supg Rr(@) = oo for any estimator T = T(X). 

(b) Let S$ = {aX +6: a€R,b€ R}. Show that 0 is a S-admissible 
estimator of 6. 


Let X1,..., Xp be i.i.d. from the exponential distribution E(a, 0) with 
a known @ and an unknown a € R. Under the squared error loss, 
show that X(1) — 6/n is the unique minimax estimator of a. 


Let X1,...,X, be iid. from the uniform distribution U(y— 4, + 4) 
with an unknown pz € R. Under the squared error loss, show that 
(Xi) + X(ny)/2 is the unique minimax estimator of p. 


Let Xy,...,Xn be ii.d. from the double exponential distribution 
DE(u,1) with an unknown p € R. Under the squared error loss, 
find a minimax estimator of ju. 


Consider Example 4.7. Show that (nX + b)/(n +1) is an admissi- 
ble estimator of 6 under the squared error loss for any b > 0 and 


that nX/(n +1) is a minimax estimator of @ under the loss function 
L(6,a) = (a— 6)?/0?. 


Let X1,...,X» be iid. binary random variables with P(X; = 1) = 

€ (0,1). Consider the estimation of p under the squared error loss. 
Using Theorem 4.14, show that X and (X + yA)/(1 +.) with \ > 0 
and 0 < y < 1 are admissible. 


Let X be a single observation. Using Theorem 4.14, find values of a 
and @ such that aX + @ are admissible for estimating EX under the 
squared error loss when 

(a) X has the Poisson distribution P(@) with an unknown 6 > 0; 

(b) X has the negative binomial distribution NB(p,r) with a known 
r and an unknown p € (0,1). 


Let X be a single observation having the Lebesgue p.d.f. $c(O)e% lI, 
|0) <1. 

(a) Show that c(@) = 1 — 0. 

(b) Show that if 0 < a < 4, then aX + @ is admissible for estimating 
E(X) under the squared error loss. 


4.6. 


84 


85. 


86. 


87. 


88. 


89. 


90. 


91. 


92. 


Exercises 311 


. Let X be a single observation from the discrete p.df. fo(z) 
= [xl(1 — e~*)|-10%e~° Ix, 9, 3 (x), where 9 > 0 is unknown. Con- 
sider the estimation of 3 = 0/(1 — e~°) under the squared error loss. 
(a) Show that the estimator X is admissible. 

(b) Show that X is not minimax unless supg Rr(@) = co for any es- 
timator T = T(X). 
(c) Find a loss function under which X is minimax and admissible. 


In Example 4.23, find the UMVUE of @ = (p1,...,p%) under the loss 
function (4.37). 


Let X be a sample from Pg, 6 € O C R”. Consider the estimation of 
6 under the loss (0 — a)’Q(0—a), where a € A = © and Q is a known 
positive definite matrix. Show that the Bayes action is the posterior 
mean £(0|X = x), assuming that all integrals involved are finite. 


In Example 4.24, show that X is the MRIE of @ under the loss function 
(4.37), if 

(a) f(x — 0) = TIf_, fi (xj — 3), where each f; is a known Lebesgue 
p.d.f. with mean 0; 

(b) f(x — 0) = f(\|x — O|]) with f xf((|zl|)dx = 0. 


Prove that X in Example 4.25 is a minimax estimator of 0 under the 
loss function (4.37). 


Let X1,...,X% be independent random variables, where X; has the 
binomial distribution Bi(p;,n;) with an unknown p; € (0,1) and a 
known n;. For estimating 0 = (p1,..., px) under the loss (4.37), find a 
minimax estimator of 8 and determine whether it is admissible. 


Show that the risk function in (4.42) tends to p as ||@|| — oo. 
Suppose that X is N,(6,I,). Consider the estimation of @ under the 


loss (a — 0)7Q(a — 0) with a positive definite p x p matrix Q. Show 
that the risk of the estimator 


On ee 
tr“ *X— Toa ope 9 


is equal to 


tr(Q) — 2r —r?)(p — 2)? E(|]Q7*/7(X — ||). 


Show that under the loss (4.37), the risk of 6.,, in (4.45) is given by 
(4.46). 


312 


93. 


94. 


95. 


96. 


97. 
98. 


99. 


4, Estimation in Parametric Models 


Suppose that X is N,(0,V) with p > 4. Consider the estimation of 6 
under the loss function (4.37). 

(a) When V = J,, show that the risk of the estimator in (4.48) is 
p— (p—3)2B(||X — XJpll-2). 

(b) When V = o?D with an unknown o? > 0 and a known matrix D, 
show that the risk function of the estimator in (4.49) is smaller than 
that of X for any @ and o?. 


Let X be a sample from a p.d.f. fg and T(X) be a sufficient statistic 
for 9. Show that if an MLE exists, it is a function of T but it may 
not be sufficient for 0. 


Let {fo : 6 € O} be a family of p.d-f.’s w.r.t. a o-finite measure, where 
Oe R*; h be a Borel function from @ onto A C R?, 1 < p< k; and 
let £(A) = supg:n(9)= (8) be the induced likelihood function for the 
transformed parameter \. Show that if 6 € @ is an MLE of @, then 
A = h(@) maximizes [(A). 


Let X1,..., Xp be iid. with a p.d.f. fg. Find an MLE of @ in each of 
the following cases. 
(a) fo(x) =97'Iy1,...0}(x), @ is an integer between 1 and 6p. 


(b) fo(x) = =e — (#8) )T(6,00) (x x), 6>0. 

(c) fo(z) = a = 2)? "Tos @), 6 >1. 

(d) fo(x) = re NEG 1) (x), 8 € (5,1). 

(e) fo(z) = sel OER. 

() Jala) = 60-"I, x(t), > 0. 

(g) fo(x) = 0*(1 — 8=)'~*Ipo,1} (2), 6 € [5, 4]- 

(h) fo(a) is the p.d.f. of N(6, 67), OER, OF0. 

(i) fe(x) is the p.d.f. of the exponential distribution E(u,c), 0 = 
(u,0) € R x (0,00). 

(j) fo(x) is the p.d.f. of the log-normal distribution LN(,07), 0 = 
(u,07) € R x (0,00). 

(k) fo(x) = Io,1)(a) if 0 =0 and fo(x) = (27%) *1(0,1) (x) if 6 = 1. 
1) Jae) = 8 aa Nala) @= (48) € (0.20) x 0.0), 

(m) fo(x) = (?)p*(1—p)®-*Iyo,1,...,6} (a), 8 ., where p € (0, 1) 
is known. 

(n) fo(x) = $(1 — 6? )e%—I#!, 6 € (-1,1). 


In Exercise 14, obtain an MLE of @ when (a) 6 € R and (b) 0 <0. 


Suppose that n observations are taken from N(j, 1) with an unknown 
pt. Instead of recording all the observations, one records only whether 
the observation is less than 0. Find an MLE of w. 


Find an MLE of @ in Exercise 43 of §2.6. 


4.6. Exercises 313 


100. 


101. 


102. 


103. 


104. 


105. 


106. 


107. 


108. 


Let (Yi, 21), -.-,; (Yn; Zn) be iid. random 2-vectors such that Y; and 
Z, are independently distributed as the exponential distributions 
E(0,A) and E(0, 1), respectively, where A > 0 and pz > 0. 

(a) Find the MLE of (), 1). 

(b) Suppose that we only observe X; = min{Y;, Z;} and A; = 1 if 
X; = Y; and A; =0 if X; = Z;. Find the MLE of (A, ). 


In Example 4.33, show that almost surely the likelihood equation has 
a unique solution that is the MLE of 0 = (a,7). Obtain iteration 
equation (4.53) for this example. Discuss how to apply the Fisher- 
scoring method in this example. 


Let X1,..., Xp be i.i.d. from the discrete p.d.f. in Exercise 84 with an 
unknown 6 > 0. Show that the likelihood equation has a unique root 
when the sample mean > 1. Show whether this root is an MLE of 0. 


Let X1,..., Xn be iid. from the logistic distribution LG(u,o) (Table 
1.2, page 20). 

(a) Show how to find an MLE of when pz € R and o is known. 

(b) Show how to find an MLE of o when o > 0 and p is known. 


Let (X1, Yi), ..-; (Xn, Yn) be i.i.d. from a two-dimensional normal dis- 
tribution with E(X,) = E(Y1) = 0, Var(X,) = Var(¥1) = 1, and an 
unknown correlation coefficient p € (—1,1). Show that the likelihood 
equation is a cubic in p and the probability that it has a unique root 
tends to 1 as n > oo. 


Let X4,..., Xn be iid. from the Weibull distribution W(a,0) (Ta- 
ble 1.2, page 20) with unknown a > 0 and @ > 0. Show that 
the likelihood equation is equivalent to h(a) = n7! 3°", log a; and 
6=n-1 YL, 2%, where h(a) = (07, 2%)! SL, wf log a; — a", 
and that the likelihood equation has a unique solution. 


Consider the random effects model in Example 3.17. Assume that 
ft = 0 and n; = no for all i. Provide a condition on X;;’s under which 
a unique MLE of (02,07) exists and find this MLE. 


Let X1,..., Xp, be iid. with the p.d.f. 6f(@x), where f is a Lebesgue 
p.d.f. on (0,00) or symmetric about 0, and @ > 0 is an unknown 
parameter. Show that the likelihood equation has a unique root if 
xf'(x)/f(a) is continuous in x and strictly decreasing for x > 0. 
Verify that this condition is satisfied if f is the p.d.f. of the Cauchy 
distribution C(0, 1). 


Let Xq,...,Xn be iid. with the Lebesgue p.d-f. fo(x) = Ofi(x) + 
(1—0) fo(x), where f;’s are two different known Lebesgue p.d.f.’s and 


314 


109. 


110. 


111. 


112. 


113. 


114. 


115. 


116. 


4, Estimation in Parametric Models 


6 € (0,1) is unknown. 

(a) Provide a necessary and sufficient condition for the likelihood 
equation to have a unique solution and show that if there is a solution, 
it is the MLE of 0. 

(b) Derive the MLE of @ when the likelihood equation has no solution. 


Consider the location family in §4.2.1 and the scale family in §4.2.2. 
In each case, show that an MLE or an RLE (root of the likelihood 
equation) of the parameter, if it exists, is invariant. 


Let X be a sample from Pg, 0 © R. Suppose that P9’s have p.d.f.’s 
fo w.r.t. a common o-finite measure and that {x : fo(x) > 0} does 
not depend on @. Assume further that an estimator 6 of 6 attains 
the Cramér-Rao lower bound and that the conditions in Theorem 3.3 
hold for 6. Show that 6 is a unique MLE of 6. 


Let Xij, 9 =1,...,7 >1,7=1,...,n, be independently distributed as 
N(wi,o7). Find the MLE of (11,...,n, 07). Show that the MLE of 
o? is not a consistent estimator (as n — 00). 


Let Xj,...,X, be ii.d. from the uniform distribution U(0,6), where 
@ > 0 is unknown. Let 6 be the MLE of 6 and T be the UMVUE. 
(a) Obtain the ratio mser(#)/mseg(@) and show that the MLE is 
inadmissible when n > 2. 

(b) Let Z,,9 be a random variable having the exponential distribution 
E(a,0). Prove n(@ — 0) +4 Zo,9 and n(0 — T) +4 Z_9,9. Obtain the 
asymptotic relative efficiency of 6 wart. T. 


Let X1,...,X» be iid. from the exponential distribution E(a,@) with 
unknown a and 6. Obtain the asymptotic relative efficiency of the 
MLE of a (or @) w.r.t. the UMVUE of a (or 6). 


Let Xj,...,X, be ii.d. from the Pareto distribution Pa(a,@) with 
unknown a and @. 

(a) Find the MLE of (a, 8). 

(b) Find the asymptotic relative efficiency of the MLE of a w.r.t. the 
UMVUE of a. 


In Exercises 40 and 41 of §2.6, 

(a) obtain an MLE of (6,7); 

(b) show whether the MLE of j in part (a) is consistent; 

(c) show that the MLE of 6 is consistent and derive its nondegenerated 
asymptotic distribution. 


In Example 4.36, obtain the MLE of @ under the canonical link and 
assumption (4.58) but ¢; F 1. 


4.6. Exercises 315 


117. 


118. 


119. 


120. 


121. 


122. 
123. 


124. 


125. 


Consider the GLM in Example 4.35 with ¢; = 1 and the canonical 
link. Assume that 57", Z;Z7 is positive definite for n > no. Show 
that the likelihood equation has at most one solution when n > no 
and a solution exists with probability tending to 1. 


Consider the linear model (3.25) with « = N,(0,V), where V is an 
unknown positive definite matrix. Show that the LSE B defined by 
(3.29) is an MQLE and that ( is an MLE if and only if one of (a)-(e) 
in Theorem 3.10 holds. 


Let X; be a random variable having the binomial distribution 
Bi(p;,n;) with a known n,; and an unknown p, € (0,1), j = 1,2. 
Assume that X,’s are independent. ee a conditional likelihood 


function of the odds ratio 0 = 7 2 re given X; + Xo. 


Let X; and X92 be independent from Poisson distributions P(j11) and 
P(2), respectively. Suppose that we are interested in 6; = p1/p2. 
Derive a conditional likelihood function of 6,, using (a) 02 = p41; (b) 
02 = pi + pg; and (c) 62 = pipe. 


Assume model (4.66) with p = 2 and normally distributed i.i.d. ¢,’s. 
Obtain the conditional likelihood given (X1, X2) = (x1, 42). 


Prove the claim in Example 4.38. 


Prove (4.70). (Hint: pee using the argument in proving (4.77), that 
n"|2 log l(En) — ce log €(@)| = o,(1) for any random variable €,, 
satisfying |f, — 0| < |0 — 0,|.) 


Let Xj,...,Xy be iid. from N(p,1) truncated at two known points 
a < P, ie., the Lebesgue p.d.f. of X; is 


{V2n[®(6 — 1) — ®(a— w)]} 2“ T.0,) (2). 


(a) Show that the sample mean X is asymptotically efficient for esti- 
mating 6 = EX. 
(b) Show that X is the unique MLE of 0. 


Let X1,..., Xp be i.i.d. from the discrete p.d-f. 


fo(x) = [1 — (1-8) ™]-7 (2) 07 (1 — 0)" 8412... my (2)s 


where 6 € (0,1) is unknown and m > 2 is a known integer. 

(a) When the sample mean X =m, show that X/m is an MLE of 0. 
(b) When 1 < X < m, show that the likelihood equation has at least 
one solution. 

(c) Show that the regularity conditions of Theorem 4.16 are satisfied 
and find the asymptotic variance of a consistent RLE of 6. 


316 


126. 


127. 


128. 


129. 


130. 


131. 


132. 


133. 


134. 


4, Estimation in Parametric Models 


In Exercise 96, check whether the regularity conditions of Theorem 
4.16 are satisfied for cases (b), (c), (d), (e), (g), (h), (j) and (n). 
Obtain nondegenerated asymptotic distributions of RLE’s for cases 
in which Theorem 4.17 can be applied. 


Let X1,...,X» be ii.d. random variables such that log X; is N (6,0) 
with an unknown 6 > 0. 

(a) Obtain the likelihood equation and show that one of the solutions 
of the likelihood equation is the unique MLE of @. 

(b) Using Theorem 4.17, obtain the asymptotic distribution of the 
MLE of 6. 


In Exercise 107 of §3.6, find the MLE’s of a and ( and obtain their 
nondegenerated asymptotic joint distribution. 


In Example 4.30, show that the MLE (or RLE) of @ is asymptotically 
efficient by (a) applying Theorem 4.17 and (b) directly deriving the 
asymptotic distribution of the MLE. 


In Example 4.23, show that there is a unique asymptotically efficient 
RLE of 6 = (p1,...,p%). Discuss whether this RLE is the MLE. 


Let X1,..., Xn be iid. with P(X, = 0) = 667 — 404+1, P(X; =1)= 
6 — 26, and P(X, = 2) = 30 — 46, where 6 € (0,4) is unknown. 
Apply Theorem 4.17 to obtain the asymptotic distribution of an RLE 
of 0. 


Let Xj,..., Xn bei.id. random variables from N(y, 1), where p € R is 
unknown. Let 0 = P(X, < c), where c is a known constant. Find the 
asymptotic relative efficiency of the MLE of 0 w.r.t. (a) the UMVUE 
of @ and (b) the estimator n7! 07", [00,4 (Xi). 


In Exercise 19 of §3.6, find the MLE’s of @ and 0 = P(Y, > 1) and find 
the asymptotic relative efficiency of the MLE of 0 w.r.t. the UMVUE 
of # in part (b). 


Let (X1,¥V1),.-.;(Xn, Yn) be iid. random 2-vectors. Suppose that 
both X, and Yj are binary, P(X; = 1) = §, P(¥i = 1|X1 = 0) = 
e~%, and P(Y; = 1|X1 = 0) = e~°’, where @ > 0 is unknown and 
a > 0 and b > 0 are known constants. 

(a) Suppose that (X;, Y;), 2 = 1,...,n, are observed. Find the MLE 
of # and its nondegenerated asymptotic distribution. 

(b) Suppose that only Yj,..., Y, are observed. Find the MLE of @ and 
its nondegenerated asymptotic distribution. 

(c) Calculate the asymptotic relative efficiency of the MLE in (a) 
w.r.t. the MLE in (b). How much efficiency is lost in the special case 
of a = b? 


4.6. Exercises 317 


135. 


136. 


137. 


138. 


139. 


140. 


141. 


142. 


143. 


144. 


In Exercise 110 of 83.6, derive 

(a) the MLE of (01, 62); 

(b) a nondegenerated asymptotic distribution of the MLE of (61, 42); 
(c) the asymptotic relative efficiencies of the MLE’s w.r.t. the moment 
estimators in Exercise 110 of §3.6. 


In Exercise 104, show that the RLE of p is asymptotically distributed 
as N(p,(1— p2)?/[n(1 + 6”). 


In Exercise 107, obtain a nondegenerated asymptotic distribution of 
the RLE of 6 when f is the p.d.f. of the Cauchy distribution C(0, 1). 


Let X,...,X» be i.i.d. from the logistic distribution LG(p,0) with 
unknown up € R and o > 0. Obtain a nondegenerated asymptotic 
distribution of the RLE of (y,¢). 


In Exercise 105, show that the conditions of Theorem 4.16 are satis- 
fied. 


Let X1,..., Xp be ii.d. binary random variables with P(X; = 1) =p, 
where p € (0,1) is unknown. Let J, be the MLE of 0 = p(1—p). 

(a) Show that J, is asymptotically normal when p 4 4. 

(b) When p = 4, derive a nondegenerated asymptotic distribution of 


VY, with an appropriate normalization. 


Let (X1,Y1),.--;(Xn,¥n) be iid. random 2-vectors satisfying 0 < 
X,<1,0<Y, <1, and 


P(X, >2,Y, > y) =(1-—2)(1 — y)(1— max{z, y})° 


forO<a<1,0<y< 1, where 6 > 0 is unknown. 

(a) Obtain the likelihood function and the likelihood equation. 

(b) Show that an RLE of 6 is asymptotically normal and derive its 
amse. 


Assume the conditions in Theorem 4.16. Suppose that 0 = (61,..., Ox) 
and there is a positive integer p < k such that Olog ¢(0)/00; and 
Olog £(@)/06,; are uncorrelated whenever i < p < j. Show that the 
asymptotic distribution of the RLE of (1,...,0,) is unaffected by 
whether 041, ...,9% are known. 


Let X1,..., Xp be iid. random p-vectors from N,(1, &) with unknown 
jeand &. Find the MLE’s of u and & and derive their nondegenerated 
asymptotic distributions. 


Let Xj,..., Xp be i.i.d. bivariate normal random vectors with mean 
0 and an unknown covariance matrix whose diagonal elements are 


318 


145. 


146. 


147. 


148. 
149. 


150. 


151. 


152. 


153. 


154. 


4, Estimation in Parametric Models 


o? and o% and off-diagonal element is oj02p. Let 6 = (07,03, p). 


Obtain [,,(@) and [J,(@)]~+ and derive a nondegenerated asymptotic 
distribution of the MLE of 6. 


Let X1,..., Xn be iid. each with probability p as N(,07) and prob- 
ability 1 — p as N(n,77), where @ = (u,7, 07,77, p) is unknown. 

(a) Show that the conditions in Theorem 4.16 are satisfied. 

(b) Show that the likelihood function is unbounded. 

(c) Show that an MLE may be inconsistent. 


Let X1,..., Xp and Y},...,¥, be independently distributed as N(, 07) 
and N(,77), respectively, with unknown 6 = (1,07,77). Find the 
MLE of @ and show that it is asymptotically efficient. 


Find a nondegenerated asymptotic distribution of the MLE of (a7, 07) 
in Exercise 106. 


Under the conditions in Theorem 4.18, prove (4.85) and (4.86). 


Assume linear model (3.25) with « = N,(0,07J,) and a full rank 
Z. Apply Theorem 4.18 to show that the LSE (@ is asymptotically 
efficient. Compare this result with that in Theorem 3.12. 


Apply Theorem 4.18 to obtain the asymptotic distribution of the RLE 
of @ in (a) Example 4.35 and (b) Example 4.37. 


Let Xj, ...,X,y be i.i.d. from the logistic distribution LG(u,o), uw € R, 
o > 0. Using Newton-Raphson and Fisher-scoring methods, find 

a) one-step MLE’s of when o is known; 

b) one-step MLE’s of o when yp is known; 

c) one-step MLE’s of (1, 0); 

d) ,/n-consistent initial estimators in (a)-(c). 


Under the GLM (4.55)-(4.58), 

a) show how to obtain a one-step MLE of #, if an initial estimator 
BO) is available; 

(b) show that under the conditions in Theorem 4.18, the one-step 


MLE satisfies (4.81) if ||[Z,,(3)|*/2(8© — 8)|| = O,(1). 


In Example 4.40, show that the conditions in Theorem 4.20 concern- 
ing the likelihood function are satisfied. 


Let X1,..., Xp be iid. from the logistic distribution LG(u,c) with 
unknown u € R and a > 0. Show that the conditions in Theorem 
4.20 concerning the likelihood function are satisfied. 


Chapter 5 


Estimation in 
Nonparametric Models 


Estimation methods studied in this chapter are useful for nonparametric 
models as well as for parametric models in which the parametric model 
assumptions might be violated (so that robust estimators are required) 
or the number of unknown parameters is exceptionally large. Some such 
methods have been introduced in Chapter 3; for example, the methods 
that produce UMVUE’s in nonparametric models, the U- and V-statistics, 
the LSE’s and BLUE’s, the Horvitz-Thompson estimators, and the sample 
(central) moments. 


The theoretical justification for estimators in nonparametric models, 
however, relies more on asymptotics than that in parametric models. This 
means that applications of nonparametric methods usually require large 
sample sizes. Also, estimators derived using parametric methods are asymp- 
totically more efficient than those based on nonparametric methods when 
the parametric models are correct. Thus, to choose between a parametric 
method and a nonparametric method, we need to balance the advantage of 
requiring weaker model assumptions (robustness) against the drawback of 
losing efficiency, which results in requiring a larger sample size. 

It is assumed in this chapter that a sample X = (Xj,...,X») is from a 
population in a nonparametric family, where X;’s are random vectors. 


5.1 Distribution Estimators 


In many applications the c.d.f.’s of X,’s are determined by a single c.d_f. 
F on R4; for example, X;’s are i.i.d. random d-vectors. In this section, we 


319 


320 5. Estimation in Nonparametric Models 


consider the estimation of F' or F(t) for several t’s, under a nonparametric 
model in which very little is assumed about F’. 


5.1.1 Empirical c.d.f.’s in i.i.d. cases 


For i.i.d. random variables Xj,...,X,, the empirical c.d.f. F,, is defined in 
(2.28). The definition of the empirical c.d.f. based on X = (Xj,..., Xn) in 
the case of X; € R4 is analogously given by 


n 


1 
F(t) = = » T(-00,4)(Xi), te R4, (5.1) 
where (—co,a] denotes the set (—o0o,ai] x --- x (—co, aq] for any a = 


(a1,...,@a) € R*. Similar to the case of d = 1 (Example 2.26), F,(t) as 
an estimator of F(t) has the following properties. For any t € R%, nF,,(t) 
has the binomial distribution Bi(F(t),n); F,(t) is unbiased with variance 
F(t)(1 — F(t)|/n; F,(t) is the UMVUE under some nonparametric mod- 
els; and F;,(t) is \/n-consistent for F(t). For any m fixed distinct points 
t1,..,tm in R®%, it follows from the multivariate CLT (Corollary 1.2) and 
(5.1) that as n > oc, 


Vn (Fr(t1), --) Fn(tm)) — (F(t), -; F(tm))] a Nm (0,5), (5,2) 
where © is the m x m matrix whose (i, 7)th element is 
P(X, E (—o0, ti] NM (—o0, t;]) = F(t,)F (t;). 


Note that these results hold without any assumption on F’. 


Considered as a function of t, F;, is a random element taking values in 
F, the collection of all c.d.f.’s on R?. As n — 00, /n(F, — F) converges 
in some sense to a random element defined on some probability space. A 
detailed discussion of such a result is beyond our scope and can be found, for 
example, in Shorack and Wellner (1986). To discuss some global properties 
of F,, as an estimator of F' € ¥, we need to define a closeness measure 
between the elements (c.d.f.’s) in F. 


Definition 5.1. Let Fp be a collection of c.d.f.’s on R?. 

(i) A function g from Fo x Fo to [0, 00) is called a distance or metric on Fo 
if and only if for any G,; in Fo, (a) 0(G1,G2) = 0 if and only if G; = Go; 
(b) 0(G1, G2) = o(G2,G1); and (c) 0(G1, G2) < o(G1,G3) + o(Gs, G2). 
(ii) Let D = {c(G, — Go): cE R, Gj € Fo, j = 1,2}. A function || - || 
from D to [0,0o) is called a norm on D if and only if (a) ||A|| = 0 if and 
only if A = 0; (b) |/cAl] = |e|||A|| for any A € D and c € R; and (c) 
[Ay + Ag] < |[Ai|] + |[Ael] for any A; ED, 7 =1,2. 8 


5.1. Distribution Estimators 321 


Any norm ||-|| on D induces a distance given by 0(G1, G2) = ||G1 — Gal]. 
The most commonly used distance is the sup-norm distance Qo, i.e., the 
distance induced by the sup-norm 


|G = Galloo = sup |G (t) — Go(t)], G; e F. (5.3) 
teR? 


The following result concerning the sup-norm distance between F;, and F’ 
is due to Dvoretzky, Kiefer, and Wolfowitz (1956). 


Lemma 5.1. (DKW’s inequality). Let F, be the empirical c.d.f. based on 
iid. X1,..., Xn from ac.d.f. F on R4. 

(i) When d = 1, there exists a positive constant C (not depending on F’) 
such that 


P(0c0(Fa; F) > 2) < Ce“, z > 0,n =1,2..... 


(ii) When d > 2, for any € > 0, there exists a positive constant C..q (not 
depending on F’) such that 


P(os(Fay P) > 2) < Caen! Pn) ne z>0,n=1,2,... I 


The proof of this lemma is omitted. The following results useful in 
statistics are direct consequences of Lemma 5.1. 


Theorem 5.1. Let F, be the empirical c.d.f. based on i.i.d. Xy,...,Xn 
from ac.d.f. F on R¢. Then 

(i) @c0(Fn, F) a.s. 0 as n — 00; 

(ii) EL /n0c0(Fn, F)|* = O(1) for any s > 0. 

Proof. (i) From DKW’s inequality, 


Co 


dP 0co( Fn, F) )>2z) <0. 


Hence, the result follows from Theorem 1.8(v). 
(ii) Using DKW’s inequality with z = y'/*/,./n and the result in Exercise 
55 of $1.6, we obtain that 


El V000(Fn, F))° = i P(Vit0c0(Fn, F) > y*/*) dy 


< Cua f e 2-99! dy 
0 
= O(1) 


aslongas2—e>0. I 


322 5. Estimation in Nonparametric Models 


Theorem 5.1(i) means that F,(t) -a.s. F(t) uniformly in t € R4, a 
result stronger than the strong consistency of F,,(t) for every t. Theorem 
5.1(ii) implies that /Ne0(Fn, F) = Op,(1), a result stronger than the /n- 
consistency of F,,(t). These results hold without any condition on F’. 

Let p> land F¥, ={GeF: f ||t\|?dG < co}, which is the subset of 
c.d.f.’s in ¥ having finite pth moments. Mallows’ distance between G, and 
G2 in F, is defined to be 


em, (Gi, G2) = inf(El|Yi — Yol|?)'/”, (5.4) 
where the infimum is taken over all pairs of Y; and Y2 having c.d.f.’s G; and 
Go, respectively. Let {Gj :j =0,1,2,...} C Fp. Then oy (Gj,Go) > 0 as 
j > co if and only if f||t||PdG; — f ||t||PdGo and G;(t) + Go(t) for every 
t € R? at which Go is continuous. It follows from Theorem 5.1 and the 
SLLN (Theorem 1.13) that Om, (Fn, F) as. Oif F € Fy. 


When d = 1, another useful distance for measuring the closeness be- 
tween F;, and F is the L, distance o, induced by the Lp-norm (p 2 1) 


1/p 


|G1 — Gallz, = | [ iene) — cater 5 G; € F,. (5.5) 
A result similar to Theorem 5.1 is given as follows. 


Theorem 5.2. Let F;, be the empirical c.d.f. based on i.i.d. random vari- 
ables X1,..., Xp from ac.d.f. F € ¥;. Then 

(i) 01, (Fn, F) a.s. 95 

(ii) El /noy, (Fn, F)] = O(1) if1 <p <2and [{F(t)[1— F(t)]}?/2dt < 00, 
or p > 2. 

Proof. (i) Since [o, (Fn, F)]? < [Occ (Fn, F)}”""[or, (Fn, F)] and, by The- 
orem 5.1, Qco(Fn, fF’) —a.s. 0, it suffices to show the result for p = 1. Let 
Yi = foo (U-co,4(Xi) — F(t)]dt. Then Yj,..., Yn are iid. and 


BIY;| < / E|Io0,(Xi) — F(d)|dt = 2 / F(t)[1 — F(blat, 


which is finite under the condition that F € F,. By the SLLN, 


—oco 


i: [Fn (t) — F(t)]dt = Sy Yi as. E(¥i) = 0. (5.6) 


Since [F(t) — F()|_ < F(t) and f°. F()dt < oo (Exercise 55 in §1.6), 
it follows from Theorem 5.1 and the dominated convergence theorem that 
[°[Fn(t) — F(®)J_dt -a.s. 0, which with (5.6) implies 


[ |Fn(t) — F(t)|dt a.s. 0. (5.7) 


—co 


5.1. Distribution Estimators 323 


The result follows since we can similarly show that (5.7) holds with fe. 
replaced by [>~ 
(ii) When 1 < p < 2, the result follows from 


Elor, (Fn, F)] <{ f eRw- rooprae) 


where the two inequalities follow from Jensen’s inequality. When p > 2, 
Eloy, (Fas F)] SE { [O00(Fus PF)?” (op, (Fn, FIP? } 


< { Bloo(Fa, FOV (01, Fa PP} 


= {o(n-A-2/n4/2) 0" 12 { IF, (t) - Foopae) 


= o-e-2inry {2 fra rio” 
= O(n-¥), 


where at + = 1, the second inequality follows from Hélder’s inequality (see 
(1.40) in §1.3.2), and the first equality follows from Theorem 5.1(ii). 


5.1.2 Empirical likelihoods 


In §4.4 and §4.5, we have shown that the method of using likelihoods pro- 
vides some asymptotically efficient estimators. We now introduce some 
likelihoods in nonparametric models. This not only provides another justi- 
fication for the use of the empirical c.d.f. in (5.1), but also leads to a useful 
method of deriving estimators in various (possibly non-i.i.d.) cases, some 
of which are discussed later in this chapter. 


Let Xj,...,X, be iid. with F € F and Pg be the probability measure 
corresponding to G € ¥. Given X; = 21,...,Xn = Ln, the nonparametric 
likelihood function is defined to be the following functional from F to [0, co): 


&(G) = Il Pe({zi}), GeF. (5.8) 


324 5. Estimation in Nonparametric Models 


Apparently, €(G) = 0 if Pe({x;}) = 0 for at least one i. The following 
result, due to Kiefer and Wolfowitz (1956), shows that the empirical c.d_f. 
F,, is a nonparametric maximum likelihood estimator of F’. 


Theorem 5.3. Let X1,..., Xp be iid. with F € F and €(G) be defined by 
(5.8). Then F,, maximizes ((G) over G € F. 

Proof. We only need to consider G € ¥ such that ¢(G) > 0. Let c € (0, 1] 
and ¥(c) be the subset of ¥ containing G’s satisfying pj = Po({2x;}) > 0, 
i = 1,...,n, and 7, p; = c. We now apply the Lagrange multiplier 
miothiod to solve the problem of maximizing ¢(G) over G € F(c). Define 


H(p1, «Pn, A) =] [pi +r (x a - ; 
t=1 t=1 


where is the Lagrange multiplier. Set 


dH << OH 
x = DLP e=0, ae ST eNO eee 
i=l 


w=1 


The solution is pj = c/n, i = 1,...,.n, A = —(e/n)"~+. It can be shown 
(exercise) that this solution is a maximum of H(p1,...,pn,A) over pi > 0, 
i=1,....n, 7) pi =c. This shows that 

max &(G) = (c/n)”, 

ana €(G) = (e/n) 
which is maximized at c = 1 for any fixed n. The result follows from 
Pr, ({zi}) =n} for given X;=2;,i=1,..,n. 


From the proof of Theorem 5.3, F;, maximizes the likelihood ¢(G) in 
(5.8) over pj > 0, i= 1,...,n, and >", p; = 1, where p; = Po({z;}). This 
method of deriving an estimator of F can be extended to various situations 
with some modifications of (5.8) and/or constraints on p;’s. Modifications 
of the likelihood in (5.8) are called empirical likelihoods (Owen, 1988, 2001; 
Qin and Lawless, 1994). An estimator obtained by maximizing an empirical 
likelihood is then called a maximum empirical likelihood estimator (MELE). 
We now discuss several applications of the method of empirical likelihoods. 


Consider first the estimation of F’ with auxiliary information about F' 
(and iid. X1,...,X,). For instance, suppose that there is a known Borel 
function u from R4 to R* such that 


[uqar = 0 (5.9) 


(e.g., some components of the mean of F are 0). It is then pears to 
expect that any estimate F of F has property (5.9), ie, fu a)dF = = 0, 


5.1. Distribution Estimators 325 


which is not true for the empirical c.d.f. F, in (5.1), since 


n 


[umar, = = So u(X;) #0 


i=1 
even if E[u(X1)] = 0. Using the method of empirical likelihoods, a natu- 


ral solution is to put another constraint in the process of maximizing the 
likelihood. That is, we maximize ¢(G) in (5.8) subject to 


p>0, i=1,...,n, Soi =1, and S- pitu(wi) = 0, (5.10) 


where p; = Pe({2;}). Using the Lagrange multiplier method and an argu- 
ment similar to the proof of Theorem 5.3, it can be shown (exercise) that 
an MELE of F is 


= oa =r 6.4 (5.11) 


where the notation (—oo, t] is the same as that in (5.1), 
p=n [1+ rA,u(Xi)]7', $= Tyee ey (5.12) 


and A, € R* is the Lagrange multiplier satisfying 


Yo fmt i) = SD eon rom (5.13) 


Note that F reduces to F,, ifu=0. 
To see that (5.13) has a solution asymptotically, note that 


oa pee 2 1G u(Xi) 
OX [Eye oe 2) _ “p= 14+ ATUu(X;) 
and 
1 ym (Xi) [u(X)]” 
log(1 +7 Seca, ee 
ane ae % eB ns os) n 24 +A u(Xi) 
which is negative definite if Var(u(X,)) is positive definite. Also, 
roam lee 7 
E {a E 2 ea( +A wx) 


Hence, using the same argument as in the proof of Theorem 4.18, we can 
show that there exists a unique sequence {A,,(X)} such that as n > oo, 


i HEXEN = 2 


= Elu(X1)] = 0. 
=0 


326 5. Estimation in Nonparametric Models 


Theorem 5.4. Let X1,..., Xn be iid. with F € F, u be a Borel function 
on R® satisfying (5.9), and F be given by (5.11)-(5.13). Suppose that 
U = Var(u(X1)) is positive definite. Then, for any m fixed distinct 1, ..., tm 
in R¢, 
where 

Lo, = -W'U'W, 
D is given in (5.2), W= (W(t), .... W(tm)), W (tj) = E[u(X1)L(—00,¢;] (41), 
and the notation (—oo, t] is the same as that in (5.1). 
Proof. We prove the case of m = 1. The case of m > 2 is left as an 
exercise. Let u = n7!S>i_, u(X;). It follows from (5.13), (5.14), and 
Taylor’s expansion that 


n 


@= =)“ u(Xi)[u(X,)]" nll + op(1)} 
i=1 

By the SLLN and CLT, 
UN = dn + op(n-¥?), 


Using Taylor’s expansion and the SLLN again, we have 
Ee —co,t] (Xi) (np; — 1) = £Femal0 cee eee 
n 4 : 1+ATu(X;) 


oe Seoul nuU(X. i) + Op(n aed) 


= —r,W ‘ ) + op(n7/?) 
= —WU- Wit) + op(n-V/?), 


I 


F(t) — F(t) = Fylt +i ~co,t] (Xi) (mpi — 1) 
= Full) — FW) — FU-WE + op") 
= FY {lhc Ki) F(t)—[a(X)]'UAWO} + op(0-™), 


The result follows from the CLT and the fact that 
Var ([W (t)|"U~*u(Xi)) = [W(t)]"U-*UU W(t) 
= [W@]|"U Wit) 
= E{[W()"U~*u( Xi) I(—co,4(Xi)} 
= Cov(I(~.0,t](Xi), [W(H)"U"u(Xi)). 


5.1. Distribution Estimators 327 


Comparing (5.15) with (5.2), we conclude that F is asymptotically more 
efficient than F,,. 


Example 5.1 (Survey problems). An example of situations in which we 
have auxiliary information expressed as (5.9) is a survey problem (Example 
2.3) where the population P = {y1,..., yw} consists of two-dimensional y,’s, 
yj = (Y1j,y2;), and the population mean Y2 = N7! Sora yo; is known. 
For example, suppose that 41; is the current year’s income of unit 7 in 
the population and y2; is the last year’s income. In many applications 
the population total or mean of yo;’s is known, for example, from tax 
return records. Let Xj,...,X, be a simple random sample (see Example 
2.3) selected from P with replacement. Then X;’s are i.i.d. bivariate random 
vectors whose c.d.f. is 


N 
ROSY EG (5.16) 
N 


where the notation (—oo, ¢] is the same as that in (5.1). If Yo is known, then 
it can be expressed as (5.9) with u(x1, 22) = x2 — Yo. In survey problems 
X;’s are usually sampled without replacement so that Xj,..., Xn are not 
i.i.d. However, for a simple random sample without replacement, (5.8) can 
still be treated as an empirical likelihood, given X;’s. Note that F' in (5.16) 
is the c.d.f. of X;, regardless of whether X;’s are sampled with replacement. 

If X = (X1,..., Xn) is not a simple random sample, then the likelihood 
(5.8) has to be modified. Suppose that 7; is the probability that the ith 
unit is selected (see Theorem 3.15). Given X = {y;,i € s}, an empirical 


likelihood is 

eG) =] (Pe tub! = T] ni: (5.17) 

ics ies 

where p; = Pe({yi}). With the auxiliary information (5.9), an MELE of F 
in (5.16) can be obtained by maximizing ¢(G) in (5.17) subject to (5.10). 
In this case F' may not be the c.d.f. of X;, but the c.d.f.’s of X;’s are 
determined by F and 7;’s. It can be shown (exercise) that an MELE is 
given by (5.11) with 


a 1 
B= mil + arene Tt) (5.18) 
and - 
UY _ 
» ml +AZu(y)] : (5.19) 


1ES 


If 7; = a constant, then the MELE reduces to that in (5.11)-(5.18). If 


328 5. Estimation in Nonparametric Models 


u(x) = 0 (no auxiliary information), then the MELE is 


PW) =D Iocan) / D= 


ies? ics 


which is a ratio of two Horvitz-Thompson estimators (§3.4.2). Some asymp- 
totic properties of the MELE F’ can be found in Chen and Qin (1993). I 


The second part of Example 5.1 shows how to use empirical likelihoods 
in a non-i.i.d. problem. Applications of empirical likelihoods in non-i.i.d. 
problems are usually straightforward extensions of those in i.i.d. cases. The 
following is another example. 


Example 5.2 (Biased sampling). Biased sampling is often used in applica- 
tions. Suppose that n = ny +---+nx, k > 2; X;’s are independent random 
variables; X1,...,Xn, are iid. with Fy and Xn,4..4nj;t15-5 Xnptetnjar 
are i.i.d. with the c.d-f. 


[entrar | [ wasloare), 


j=1,...,k—1, where w,’s are some nonnegative Borel functions. A simple 
example is that X1,..., Xn, are sampled from F and Xpn,41,..., Xn ;4n. are 
sampled from F but conditional on the fact that each sampled value exceeds 
a given value 2 (i.e., W2(s) = I(eo,00)(S)). For instance, X;’s are blood 
pressure measurements; X1,...,Xp, are sampled from ordinary people and 
Xn 41) Xn ;+n, are sampled from patients whose blood pressures are 
higher than xp. The name biased sampling comes from the fact that there 
is a bias in the selection of samples. 


For simplicity we consider the case of k = 2, since the extension to k > 3 
is straightforward. Denote wz by w. An empirical likelihood is 


a@)=T] Pettey) TT “eee 
i=l j=nj4+1 


—n2 


S mntoa| [> I w(2i), (5.20) 


i=l i=nj4+1 


where p; = Pe({xi}). An MELE of F' can be obtained by maximizing the 
empirical likelihood (5.20) subject to p; > 0, i = 1,...,n, and 0", pj = 
1. Using the Lagrange multiplier method we can show (exercise) that an 
MELE F is given by (5.11) with 


pi = [ni + noew(Xj)/a]7'> t=1,...,n, (5.21) 


5.1. Distribution Estimators 329 


where w satisfies 
nm 


F w(Xi) 
ai 2 ny + ngw(X;)/a 
An asymptotic result similar to that in Theorem 5.4 can be established 
(Vardi, 1985; Qin, 1993). 

If the function w depends on an unknown parameter vector 0, then the 
method of profile empirical likelihood (see 85.1.4) can be applied. 


Our last example concerns an important application in survival analysis. 


Example 5.3 (Censored data). Let T1,...,T%, be survival times that are 
iid. nonnegative random variables from a c.d.f. F, and C},...,Cn be iid. 
nonnegative random variables independent of T;’s. In a variety of applica- 
tions in biostatistics and life-time testing, we are only able to observe the 
smaller of T; and C; and an indicator of which variable is smaller: 


X; = min{T;, Ci}, 0; — To,c,) (Ti), i= Aes see Te 


This is called a random censorship model and C;’s are called censoring 
times. We consider the estimation of the survival distribution F’; see 
Kalbfleisch and Prentice (1980) for other problems involving censored data. 


An MELE of F can be derived as follows. Let Li Soa Ss Bia) be 
ordered values of X;’s and 53) be the d-value associated with xj). Consider 
ac.d.f. G that assigns its mass to the points 71), ...,%(n) and the interval 
(x(n), 00). Let ps = Pe({zq}), 1 = 1,....n, and pasi = 1— G(em)). An 
MELE of F is then obtained by maximizing 


EC 1-8) 
5¢i 
G) = [Jv | So 0; (5.22) 
i=1 j=i+l 
subject to 
n+1 
p>0, i=1,...,n4+1, yaa 1. (5.23) 
i=1 
It can be shown (exercise) that an MELE is 
n+1 
F(t) = S> bilo. (Xv), (5.24) 
i=1 


where X(9) = 0, X(n41) = 00, X(1) +++ < Xm) are order statistics, and 


i-1 


n 
Oi oO; ‘ 
5a (4) ) = 5 = 5 5. 
a= n—i+l (1 ~ se). t= 1, seep My Pnt1 = 1— Pj: 
gah 


330 5. Estimation in Nonparametric Models 


The F in (5.24) can also be written as (exercise) 


Fi) =1- T] (1-4), (5.25) 


Xi) St 


which is the well-known Kaplan-Meier (1958) product-limit estimator. Some 
asymptotic results for F in (5.25) can be found, for example, in Shorack 
and Wellner (1986). I 


5.1.3 Density estimation 


Suppose that Xj,...,X, are iid. random variables from F' and that F is 
unknown but has a Lebesgue p.d.f. f. Estimation of F’ can be done by 
estimating f, which is called density estimation. Note that estimators of F’ 
derived in 85.1.1 and §5.1.2 do not have Lebesgue p.d.f.’s. 

Since f(t) = F’(t) a.e., a simple estimator of f(t) is the difference 


quotient 


fn(t) = so te tER, (5.26) 


where F,, is the empirical c.d.f. given by (2.28) or (5.1) with d = 1, and 
{\,,} is a sequence of positive constants. Since 2n\, f(t) has the binomial 
distribution Bi(F(t + An) — F(t — An), 0), 


Elfn(t)] — f(t) if \, ~ 0 as n— co 


and 
Var (fn(t)) —0 if A, — 0 and nA, — oo. 


Thus, we should choose \,, converging to 0 slower than n~!. If we assume 
that An — 0, nAyn — oo, and f is continuously differentiable at t, then it 
can be shown (exercise) that 


mse , (4) (F’) = on +0 (=) + O(?) (5.27) 


and, under the additional condition that nA? — 0, 
VmAnlfalt) — F()] >a N(0,3f(). (5.28) 


A useful class of estimators is the class of kernel density estimators of 


the form 
ry = 1 “ t—X; 
f= Le (& 


), (5.29) 


5.1. Distribution Estimators 331 


where w is a Knog Lebesgue p.d.f. on R and is called the kernel. If we 
choose w(t) = $Jj-1,1)(t), then f (£) in (5.29) is essentially the same as the 


so-called histogram. The bias of f(t) in (5.29) is 


elf) — 10) = fw (% 


= SF w(y)LF(t — ny) — Flay. 


(z)dz — f(t) 


If f is bounded and continuous at ¢, then, by the dominated convergence 
theorem (Theorem 1.1(iii)), the bias of f(t) converges to 0 as A, > 0; if f’ 
is bounded and continuous at t and f |t|w(t)dt < oo, then the bias of f(t) 
is O(An). The variance of f(t) is 


Var (f (t)) = aver (w (5*)) 


ae [w(y)]? F(t — Any)dy + O () 


wof (t) 1 
Gi (=) 


if f is bounded and continuous at t and wo = f[w(t)|?dt < oo. Hence, if 
An 2 0, NAyn — oc, and f’ is bounded and continuous at t, then 


wof (t) 
nn 


I 


mse gq (F’) = + O(d2). 


Using the CLT (Theorem 1.15), one can show (exercise) that if A, — 0, 
nAn — co, and f is bounded and continuous at t, then 


Vrrnt f(t) — ELf()]} a N (0, wof (0). (5.30) 


Furthermore, if f’ is bounded and continuous at t, [ |t}w(t)dt < oo, and 
nX3 — 0, then 


Jorn fElf(t)] — f®)} = 0 (Vrdnrn) - 


and, therefore, (5.30) holds with E[f(t)] replaced by f(t). 


Similar to the estimation of a c.d.f., we can also study global properties 
of f, or f as an estimator of the density curve f, using a suitably defined 


332 5. Estimation in Nonparametric Models 


—— True p.df. 
a Estimator (5.26) 
o || —-- Estimator (5.29) 
YT 
° 
o 
(oe) 
S 
= 
N | 
fo) 
a 
° 
a 
fo) 


I I T I I 
-2 -1 0 1 2 


t 


Figure 5.1: Density estimates in Example 5.4 


distance between f and its density estimator. For example, we may study 
the convergence of sup;er | f(t) — f(t)| or f\ f(t) — f(®)|?dt. More details 
can be found, for example, in Silverman (1986). 


Example 5.4. Ani.i.d. sample of size n = 200 was generated from N(0, 1). 
Density curve estimates (5.26) and (5.29) are plotted in Figure 5.1 with the 
curve of the true p.d.f. For the kernel density estimator (5.29), w(t) = gel 
is used and A,, = 0.4. From Figure 5.1, it seems that the kernel estimate 
(5.29) is much better than the estimate (5.26). I 


There are many other density estimation methods, for example, the 
nearest neighbor method (Stone, 1977), the smoothing splines (Wahba, 
1990), and the method of empirical likelihoods described in §5.1.2 (see, 
e.g., Jones (1991)), which produces estimators of the form 


fe) = Ly pw (5%). 
t=1 


5.1. Distribution Estimators 333 


5.1.4 Semi-parametric methods 


Suppose that the sample X is from a population in a family indexed by 
(0,€), where @ is a parameter vector, i.e., 9 € © C R* with a fixed positive 
integer k, but € is not vector-valued, e.g., € is ac.d.f. Such a model is often 
called a semi-parametric model, although it is nonparametric according to 
our definition in §2.1.2. A semi-parametric method refers to a statistical 
inference method that combines a parametric method and a nonparametric 
method in making an inference about the parametric component @ and the 
nonparametric component €. In the following, we consider two important 
examples of semi-parametric methods. 


Partial likelihoods and proportional hazards models 


The idea of partial likelihood (Cox, 1972) is similar to that of conditional 
likelihood introduced in §4.4.3. To illustrate this idea, we assume that X 
has a p.d.f. fo,¢ and € is also a vector-valued parameter. Suppose that X 
can be transformed into a sequence of pairs (Vi, U1), ..., (Vin, Um) such that 


m m 
fo,e() = Tater ein tiaee)] [Phot otionticntion , 
i=1 


i=1 


where go(-|v1, U1,..-,Ui—1, i) is the conditional p.d.f. of U; given Vi = 
v1,0, = w,...,Ui-1 = w-1,Vi = vi, which does not depend on €, and 
ho,e(-|U1, U1, ---, Vi-1, Ui—-1) is the conditional p.d.f. of V; given Vj = v1, U1 = 
U1, ..,Vi-1 = U;-1, Uj_1 = uj_1. The first product in the previous expres- 
sion for fg,¢(x) is called the partial likelihood for 0. 


When € is a nonparametric component, the partial likelihood for @ can 
be similarly defined, in which case the full likelihood fo ¢(a) should be re- 
placed by a nonparametric likelihood or an empirical likelihood. As long as 
the conditional distributions of U; given Vi, U1,..., Ui-1, Vi, i = 1,...,m, are 
in a parametric family (indexed by 9), the partial likelihood is parametric. 

A semi-parametric estimation method consists of a parametric method 
(typically the maximum likelihood method in §4.4) for estimating @ and a 
nonparametric method for estimating €. 


To illustrate the application of the method of partial likelihoods, we 
consider the estimation of the c.d.f. of survival data in the random censor- 
ship model described in Example 5.3. Following the notation in Example 
5.3, we assume that {7}, ...,T,} (survival times) and {C1,...,C,} (censor- 
ing times) are two sets of independent nonnegative random variables and 
that X; = min{T;,C;} and 6; = Io,c,)(Ti), i = 1,...,n, are independent 
observations. In addition, we assume that there is a p-vector Z; of covariate 
values associated with X; and 6;. The situation considered in Example 5.3 


334 5. Estimation in Nonparametric Models 


can be viewed as a special homogeneous case with Z; = a constant. 


The survival function when the covariate vector is equal to z is defined 
to be S,(t) = 1— F(t), where F, is the c.d-f. of the survival time T having 
the same distribution as T;. Assume that f,(¢) = F{(t) exists for all t > 0. 
The ae Az(t ee = ae )/Sz(t) is called the hazard function and the 
function A,( = ly s)ds is called the cumulative hazard function, when 
the ee ree is hes to z. A commonly adopted model for A, is the 
following proportional hazards model: 


Xz(t) = Ao(t)o(8" z), (5.31) 


where ¢ is a known function (typically ¢(%) = e”), z is a value of the p- 
vector of covariates, 3 € R? is an unknown parameter vector, and A(t) is 
the unknown hazard function when the covariate vector is 0 and is referred 
to as the baseline hazard function. Under model (5.31), 


1 — F,(t) = exp{—Az(t)} = exp{—9(0"z) Ao(t)}- 


Thus, the estimation of the c.d.f. F, or the survival function S, can be done 
through the estimation of 3, the parametric component of model (5.31), and 
Ao, the nonparametric component of model (5.31). 


Consider first the estimation of @ using the method of partial likelihoods. 
Suppose that there are | observed failures at times T(1) <--- < T(y), where 
(i) is the label for the ith failure ordered according to the time to failure. 
(Note that a failure occurs when 6; = 1.) Suppose that there are m; items 
censored at or after T;;) but before T(;41) at times T(;,1), ..., 7, mi) ile: 
Tio) = 0). Let U; = (2) and V; = (Boys Ee Aas ah ae Ns eae errs 
Then the partial Pees is 


Tw (i)|Vi, U1, -.-, Ui-1, Vi). 


Since A,(t) = ear A'P.(t< T <t+Al|T >t), where P, denotes 
the probability measure of J’ when the covariate is equal to z, 
zu (ti) o(8" Zi) 
P(U; = (i)|Vi, U1, ..., Ui_a, Vs) =$§ — = SO, 
HER: Xz; (ti) Dek: (87 Z;) 
where t; is the observed value of Ty), Ri = {j : Xj > ti} is called the risk 
set, and the last equality follows from assumption (5.31). This leads to the 
partial likelihood 
l n 
$(B" Zi) $(87 Zi) 
(= Ts, 66°) ~ LL Sen, AI 
i=1 JER; #) f=} JER; Jj 
which is a function of the parameter (@, given the observed data. The 
maximum likelihood method introduced for parametric models in §4.4 can 


’ 


5.1. Distribution Estimators 335 


be applied to obtain a maximum partial likelihood estimator B of G. It 
is shown in Tsiatis (1981) that { is consistent for G and is asymptotically 
normal under some regularity conditions. 


We now consider the estimation of Ap. First, assume that the covariate 
vector Z; is random, (T;,C;, Z;) are i.i.d., and T; and C; are conditionally 
independent given Z;. Let (T,C,Z) be the random vector having the same 
distribution as (T;,Ci, Z:), X = min{T,C}, and 6 = I9c)(T). Under 
assumption (5.31), it can be shown (exercise) that 


Q(t) = P(X >t,d =1) ie Ao(s H(s|z)dsdG(z), (5.32) 
where H(s|z) = P(X > s|Z = z) and G is the c.d.f. of Z. Then 
=—rolt) | (6H UDAG( (5.33) 
and dQ(t) 1 
o(t) = aa IE) (5.34) 


where K(t) = E[¢(87 Z)I4,.0)(X)] (exercise). Consequently, 


= ome [5 


An estimator of Ag can then be obtained by coe Q and K in the 
previous expression by their estimators 


1 n 
=. ye Ip x.3+,6:=1} 
i} t=1 


and 
n 


1 ae 
K(t) = — > O(8" Zi) Lt,00) (Xi). (5.35) 
i=1 
This estimator is known as Breslow’s estimator. When Z1,...,Z, are non- 
random, we can still use Breslow’s estimator. Its asymptotic properties can 
be found, for example, in Fleming and Harrington (1991). 


Profile likelihoods 


Let €(6,€) be a likelihood (or empirical likelihood), where 6 and € are not 
necessarily vector-valued. It may be difficult to maximize the likelihood 
£(8,) simultaneously over @ and €. For each fixed 0, let €(0) satisfy 


336 5. Estimation in Nonparametric Models 


The function 

ep(9) = £(8, €(9)) 
is called a profile likelihood function for 6. Suppose that 6p maximizes 
€p(0). Then 6p is called a maximum profile likelihood estimator of 0. Note 
that 6p may be different from an MLE of @. Although this idea can be 
applied to parametric models, it is more useful in nonparametric models, 
especially when @ is a parametric component. 

For example, consider the empirical likelihood in (5.8) subject to the 
constraints in (5.10). Sometimes it is more convenient to allow the function 
u in (5.10) to depend on an unknown parameter vector 6 € R*, where k < s. 
This leads to the empirical likelihood ¢(G) in (5.8) subject to (5.10) with 
u(x) replaced by w(x, 0), where w is a known function from R?2 x R* to R°. 
Maximizing this empirical likelihood is equivalent to maximizing 


&(p1, Pn, d,9) = |] pi t+ (: = ry + So pA W (ai, 9) 
w=1 t=1 w=1 


where w and \ are Lagrange multipliers. It follows from (5.12) and (5.13) 
that w =n, p;(0) =n7-1{1 + [An(A)]"v(a;, 0)}7! with a An (0) satisfying 


jen (2,9) = 
n 2a T+ PalOrwen dy ~° 


maximize ¢(p1, ...Dn,w, A, 0) for any fixed 6. Substituting p; with \7"_, pi = 
1 into &(pi,..-Dn,w,,@) leads to the following profile empirical likelihood 
for 0: 


is 1 
- aa oree oy ae, 


If 6 is a maximum of £p(@) in (5.36), then 6 is a maximum profile empirical 
likelihood estimator of 9 and the corresponding estimator of p; is Bi(8). A 
result similar to Theorem 5.4 and a result on asymptotic normality of 6 are 
established in Qin and Lawless (1994), under some conditions on ¢). 


Another example is the empirical likelihood (5.20) in the problem of 
biased sampling with a function w(az) = we(x) depending on an unknown 
6 € R*. The profile empirical likelihood for 6 is then 


0 on ee i 
a ai “Ts ny + ngwo(x;)/wWo I eae) 


ji=nj4+1 


where wg satisfies 


be oe == “ wo (xi) 
seed Darserorverc i)/ Wo 


5.1. Distribution Estimators 337 


Finally, we consider the problem of missing data. Assume that X), ...,.Xn 
are i.i.d. random variables from an unknown c.d.f. F' and some X;’s are 
missing. Let 6; = 1 if X; is observed and 6; = 0 if X; is missing. Suppose 
that (X;,0;) are iid. Let 


If X; and 6; are independent, i.e., (a) = 7 does not depend on x, then the 
empirical c.d.f. based on observed data, i.e., the c.d.f. putting mass r~! to 
each observed X;, where r is the number of observed X;’s, is an unbiased 
and consistent estimator of F’, provided that 7 > 0. On the other hand, 
if r(a) depends on a, then the empirical c.d.f. based on observed data is a 
biased and inconsistent estimator of F’. In fact, it can be shown (exercise) 
that the empirical c.d.f. based on observed data is an unbiased estimator 
of P(X; < x|d; = 1), which is generally different from the unconditional 
probability F(2) = P(X; < 2). 

If both 7 and F are in parametric models, then we can apply the method 
of maximum likelihood. For example, if m(x) = mo(x) and F(x) = Fy(x) 
has a p.d.f. fi, where 0 and v are vectors of unknown parameters, then a 
parametric likelihood of (@,¥) is 

£(6, 0) = [[[ro(a) fo(ws)]* A — 2), 
i=1 
where 7 = [ 19(x)dF'(x). Suppose now that 7(x) = 79(zx) is the parametric 


component and F' is the nonparametric component. Then an empirical 
likelihood can be defined as 


€(0, G) = | [bro (wi)Pa({ai})] (1 — 2) 
i=1 
subject to p; > 0, 07, 6ipi = 1, Oy, Oipi[to (xi) — 7] = 0, where p; = 
Pe({ai}), i= 1,...,n. 
It can be shown (exercise) that the logarithm of the profile empirical 
likelihood for (6,7) (with a Lagrange multiplier) is 


De { 5; log (9 (ai)) +(1—6;) log(1—m) —6; log (14+A[m9(ai)—7])}. (5.37) 


Under some regularity conditions, Qin, Leung, and Shao (2002) show that 
the estimators 6, #, and \ obtained by maximizing the likelihood in (5.37) 
are consistent and asymptotically normal and that the empirical c.d_f. 
putting mass p; = r~t{1+ \g(Xi) — #]}~1 to each observed X; is con- 
sistent for F'. The results are also extended to the case where a covariate 
vector Z; associated with X; is observed for all 7. 


338 5. Estimation in Nonparametric Models 


5.2 Statistical Functionals 


In many nonparametric problems, we are interested in estimating some 
characteristics (parameters) of the unknown population, not the entire pop- 
ulation. We assume in this section that X;,’s are i.i.d. from an unknown 
c.d.f. F on R4. Most characteristics of F can be written as T(F’), where T 
is a functional from F to R°. If we estimate F’ by the empirical c.d.f. F, in 
(5.1), then a natural estimator of T(F’) is T(F;,), which is called a statistical 
functional. 


Many commonly used statistics can be written as T(F,,) for some T. 
Two simple examples are given as follows. Let T(F) = f ~(x)dF (x) with 
an integrable function w, and T(F,) = f v(x)dFy(x) = n71 Oe, (Xi). 
The sample moments discussed in §3.5.2 are particular examples of this kind 
of statistical functional. For d = 1, let T(F) = F~1!(p) = inf{a : F(z) > p}, 
where p € (0,1) is a fixed constant. F~+(p) is called the pth quantile of F. 
The statistical functional T(F;,) = F,1(p) is called the pth sample quantile. 
More examples of statistical functionals are provided in §5.2.1 and 85.2.2. 


In this section, we study asymptotic distributions of T(F;,,). We focus 
on the case of real-valued T (s = 1), since the extension to the case of s > 2 
is straightforward. 


5.2.1 Differentiability and asymptotic normality 


Note that T(F),) is a function of the “statistic” F,. In Theorem 1.12 (and 
§3.5.1) we have studied how to use Taylor’s expansion to establish asymp- 
totic normality of differentiable functions of statistics that are asymptot- 
ically normal. This leads to the approach of establishing asymptotic nor- 
mality of T(F,,) by using some generalized Taylor expansions for functionals 
and using asymptotic properties of F;, given in §5.1.1. 

First, we need a suitably defined differential of T. Several versions of 
differentials are given in the following definition. 


Definition 5.2. Let T be a functional on Fo, a collection of c.d.f.’s on R4, 
and let D = {c(G, — Gz): cE R, Gj € Fo, j = 1,2}. 

(i) A functional T on Fo is Gateaux differentiable at G € Fo if and only if 
there is a linear functional Lg on D (ie., Le(c1A1 + cg2A2) = ciLe(Ai) + 
caLg(A2) for any A; € D and c; € R) such that Ac D and G+tA € Fo 
imply 

T(G + tA) — T(G) 


lim ; = Le(A) = 0. 
(ii) Let @ be a distance on Fo induced by a norm || - || on D. A functional 


T on Fo is e-Hadamard differentiable at G € Fo if and only if there is a 


5.2. Statistical Functionals 339 


linear functional Lg on D such that for any sequence of numbers t; — 0 
and {A, Aj, 7 = 1,2,...} C D satisfying ||A; — Al] - 0 and G+t,;A; € Fo, 


i T(G + t;A;) _ T(G) 


—Le(A;)| =0. 


(iii) Let @ be a distance on Fo. A functional T on Fo is o-Fréchet differen- 
tiable at G € Fo if and only if there is a linear functional Lg on D such 
that for any sequence {G;} satisfying G; € Fo and o(G;,G) — 0, 


fm 2(Gs) = MG) — a PE) a cie vig 


The functional Lg is called the differential of T at G. If we define 
h(t) = T(G + tA), then the Gateaux differentiability is equivalent to the 
differentiability of the function h(t) at t = 0, and Lg(A) is simply h’(0). Let 
6, denote the d-dimensional c.d.f. degenerated at the point x and ¢g(x) = 
Le(dz — G). Then f(x) is called the influence function of T at F’, which 
is an important tool in robust statistics (see Hampel (1974)). 


If T is Gateaux differentiable at F’, then we have the following expansion 
(taking t = n~!/? and A = /n(F, — F)): 


ValT(F,) — T(F)] =Le(Vn(Fn — F)) + Rn. (5.38) 


Since Lr is linear, 
1 n 
Le(VinFn — F)) = Fe do or(Xi) >a N(0, 0%) (5.39) 
i=1 


by the CLT, provided that 
Elér(X1))=0 and of =Elor(X1)]? < co (5.40) 


(which is usually true when ¢r is bounded or when F' has some finite 
moments). By Slutsky’s theorem and (5.39), 


Val[T(Fn) — TF] >a N(0, 0%) (5.41) 


if Ry in (5.38) is op(1). 

Unfortunately, Gateaux differentiability is too weak to be useful in es- 
tablishing R, = o,(1) (or (5.41)). This is why we need other types of 
differentiability. Hadamard differentiability, which is also referred to as 
compact differentiability, is clearly stronger than Gateaux differentiability 
but weaker than Fréchet differentiability (exercise). For a given functional 


340 5. Estimation in Nonparametric Models 


T, we can first find Lg by differentiating h(t) = T(G+tA) at t = 0 and then 
check whether T is g-Hadamard (or o-Fréchet) differentiable with a given 
go. The most commonly used distances on Fo are the sup-norm distance 
Qo and the Ly distance 9, . Their corresponding norms are given by (5.3) 
and (5.5), respectively. 


Theorem 5.5. Let Xj,...,Xp be iid. from ac.df. F on R2. 
(i) If T is 0..-Hadamard differentiable at F', then Ry, in (5.38) is op(1). 
(ii) If T is g-Fréchet differentiable at F' with a distance o satisfying 


Vno(Fn, F) _ O,(1), (5.42) 


then R,, in (5.38) is o,(1). 

(iii) In either (i) or (ii), if (5.40) is also satisfied, then (5.41) holds. 
Proof. Part (iii) follows directly from (i) or (ii). The proof of (i) involves 
some high-level mathematics and is omitted; see, for example, Fernholz 
(1983). We now prove (ii). From Definition 5.2(iii), for any € > 0, there is 
a 6 > 0 such that |R,| < e/no(Fn, F) whenever o(F,, F) < 6. Then 


P(|Rn| >) < P (Vne(Fn, F) > n/e) + P (o(Fns F) 2 9) 
for any 7 > 0, which implies 


lim sup P (|Rn| >) < limsup P (Vno(Fn, F) > n/e) - 


The result follows from (5.42) and the fact that ¢ can be made arbitrarily 
small. I 


Since o-Fréchet differentiability implies 0-Hadamard differentiability, 
Theorem 5.5(ii) is useful when @ is not the sup-norm distance. There 
are functionals that are not 0..-Hadamard differentiable (and hence not 
Qoo-Fréchet differentiable). For example, if d = 1 and T(G) = g(f adG) 
with a differentiable function g, then T is not necessarily 0..-Hadamard 
differentiable, but is 9, ,-Fréchet differentiable (exercise). 


From Theorem 5.2, condition (5.42) holds for 9;,, under the moment 
conditions on F' given in Theorem 5.2. 

Note that if g and @ are two distances on Fo satisfying 0(G1,G2) < 
co(G1,G2) for a constant c and all G; € Fo, then 6-Hadamard (Fréchet) 
differentiability implies e-Hadamard (Fréchet) differentiability. This sug- 
gests the use of the distance Q.+p = Ooo + 0;,,, which also satisfies (5.42) 
under the moment conditions in Theorem 5.2. The distance @.+p is useful 
in some cases (Theorem 5.6). 


A @Q.-Hadamard differentiable T having a bounded and continuous in- 
fluence function ¢r is robust in Hampel’s sense (see, e.g., Huber (1981)). 


5.2. Statistical Functionals 341 


This is motivated by the fact that the asymptotic behavior of T(F;,) is de- 
termined by that of Ly(F,, — F’), and a small change in the sample, i-e., 
small changes in all x;’s (rounding, grouping) or large changes in a few 2;’s 
(gross errors, blunders), will result in a small change of T(F,,) if and only 
if dp is bounded and continuous. 

We now consider some examples. For the sample moments related to 
functionals of the form T(G) = f ~(x)dG(a), it is clear that T is a linear 
functional. Any linear functional is trivially o-Fréchet differentiable for any 
o. Next, if F is one-dimensional and F’(x) > 0 for all x, then the quantile 
functional T(G) = G~1!(p) is @..-Hadamard differentiable at F (Fernholz, 
1983). Hence, Theorem 5.5 applies to these functionals. But the asymptotic 
normality of sample quantiles can be established under weaker conditions, 
which are studied in 85.3.1. 


Example 5.5 (Convolution functionals). Suppose that F is on R and for 
a fixed z € R, 


1(G) = / Glz-y)dGy), GeF. 
If X; and Xo are iid. with c.d-f. G, then T(G) is the c.d.f. of X1 + Xo 


(Exercise 47 in §1.6), and is also called the convolution of G evaluated at 
z. For t; > 0 and ||A; — All|. — 0, 


(for A=cG,+ c2Go, G; € Fo, and CFE R, dA denotes cydG, + c2dG2). 
Using Lemma 5.2, one can show (exercise) that 


[Ae — y)dA;(y) = O(1). (5.43) 


Hence T is Q.-Hadamard differentiable at any G € F with Le(A) = 
2 [ A(z—y)dG(y). The influence function, dp(x) = 2 [(é.—F)(z—y)dF(y), 
is a bounded function and clearly satisfies (5.40). Thus, (5.41) holds. If F 
is continuous, then T is robust in Hampel’s sense (exercise). Il 


Three important classes of statistical functionals, i.e., L-estimators, M- 
estimators, and rank statistics and R-estimators, are considered in 85.2.2. 


Lemma 5.2. Let A € D and h be a continuous function on R such that 
f h(x)dA(z) is finite. Then 


| i h(x)dA(x)| < |lAllvllAllocs 


342 5. Estimation in Nonparametric Models 


where ||h||y is the variation norm defined by 


rly =. Tim sup) lh(as) — h(2j-)| 


with the supremum being taken over all partitions a = 4% <---<@m=6b 
of the interval [a,b]. 1! 


The proof of Lemma 5.2 can be found in Natanson (1961, p. 232). 


The differentials in Definition 5.2 are first-order differentials. For some 
functionals, we can also consider their second-order differentials, which pro- 
vides a way of defining the order of the asymptotic biases via expansion 
(2.37). 


Definition 5.3. Let T be a functional on Fp and o be a distance on Fo. 

(i) T is second-order g-Hadamard differentiable at G € Fo if and only if 
there is a functional Qg on D such that for any sequence of numbers t; — 0 
and {A, A;,7 = 1,2,...} C D satisfying ||A; — Al] — 0 and G+t,A, € Fo, 


fim L6G tiAs) — MG) — Qe tj As) 


jJrox ti 


= 0, 


where Qc(A) = f f va(z,y) o + A)(a + A)(y) for a function We 
satisfying A y) = vcly,2), f fve(x,y)dG(x)dG(y) = 0, and D and 
|| - || are the same as those in Definition 5.2(ii). 

(ii) T is second-order o-Fréchet differentiable at G € Fo if and only if, for 
any sequence {G;} satisfying G; € Fo and o(G;,G) — 0, 


tm LbGi) — F(G) — Qa(Gj — G) 


joe (o(G;, OP ei 


where Qg is the same as that in (i). I 


For a second-order differentiable T, we have the following expansion: 
n{T(F,) — T(F)] = nV, + Rn, (5.44) 
where 
Vn = Qr(F, — F) =f [vec yara) )dF,,(y) = =A SY vr (%,X)) 
j=l i=l 


is a “V-statistic” (§3.5.3) whose asymptotic properties are given by The- 
orem 3.16. If R, in (5.44) is o,(1), then the asymptotic behavior of 
T(f,) — T(P) is the same as that of Vp. 


5.2. Statistical Functionals 343 


Proposition 5.1. Let Xj,...,X, be i.i.d. from F’. 

(i) If T is second-order 0..-Hadamard differentiable at F', then R, in (5.44) 
is o,(1). 

(ii) If T is second-order g-Fréchet differentiable at F’ with a distance o 
satisfying (5.42), then R, in (5.44) is o,(1). I 


Combining Proposition 5.1 with Theorem 3.16, we conclude that if 


G1 = Var (/ vr(%i uaF) >0, 


then (5.41) holds with o% = 4¢, and amsey:p,)(P) = 07 /n; if ¢, = 0, then 
nT (Fr) — T(F)] a >> Agi; 
j=l 


and amseryp,)(P) = {2Var(r(X1, X2)) + [Evr(X1,X1))?}/n?. In any 
case, expansion (2.37) holds and the n~+ order asymptotic bias of T(F,,) is 
Ewp(X1, X1)/n. 


If T is also first-order differentiable, then it can be shown (exercise) that 
be(0) =2 [ ve(e,y)dFy) (5.45) 


Then ¢; = 4~'Var(¢r(X1)) and ¢; = 0 corresponds to the case of d(x) = 
0. However, second-order e-Hadamard (Fréchet) differentiability does not 
imply first-order o-Hadamard (Fréchet) differentiability (exercise). 


The technique in this section can be applied to non-i.i.d. X;’s when the 
c.d.f.’s of X;’s are determined by an unknown c.d.f. F’, provided that results 
similar to (5.39) and (5.42) (with F, replaced by some other estimator F') 
can be established. 


5.2.2 L-, M-, and R-estimators and rank statistics 


Three large classes of statistical functionals based on i.i.d. X;’s are studied 
in this section. 


L-estimators 


Let J(t) be a Borel function on [0,1]. An L-functional is defined as 
1(@) = i; aJ(G(a))dG(2),  GEFo, (5.46) 


where Fp contains all c.d.f.’s on FR for which T is well defined. For X},..., Xn 
iid. from F € Fo, T(F,) is called an L-estimator of T(F’). 


344 5. Estimation in Nonparametric Models 


Example 5.6. The following are some examples of commonly used L- 
estimators. 

(i) When J = 1, T(F,) =X, the sample mean. 

(ii) When J(t) = 4t — 2, T(F,) is proportional to Gini’s mean difference. 
(iii) When J(t) = (8 — a)~'Ia,g)(t) for some constants a < 6, T(Fh) is 
called the trimmed sample mean. IJ 


For an L-functional T, it can be shown (exercise) that 


1(G) — 1(F) = / ér(x)d(G — F)(a) + R(G, F), (5.47) 
where 
beta= ,, (62 — F)(y) JF ))dy, (5.48) 
R(G,F) = - i We (e)[G(e) — F(x)|dx, 
and 
Wale) = ¢ GO)- FOP IEG Ha - IF @) le) # Fe) 
< 0 G(x) = F(2). 


A sufficient condition for (5.40) in this case is that J is bounded and F 
has a finite variance (exercise). However, (5.40) is also satisfied if @pr is 
bounded. The differentiability of T can be verified under some conditions 
on J. 


Theorem 5.6. Let T be an L-functional defined by (5.46). 

(i) Suppose that J is bounded, J(t) = 0 when ¢ € [0,a] U [3,1] for some 
constants a < 3, and that the set D = {x : J is discontinuous at F'(x)} 
has Lebesgue measure 0. Then T is 0..-Fréchet differentiable at F' with the 
influence function ¢r given by (5.48), and ¢p is bounded and continuous 
and satisfies (5.40). 

(ii) Suppose that J is bounded, the set D in (i) has Lebesgue measure 0, 
and J is continuous on [{0,a] U [G,1] for some constants a < 3. Then T is 
0x+41-Fréchet differentiable at F’. 

(iii) Suppose that |J(¢) — J(s)| < C|t — s|?-1, where C > 0 and p > 1 are 
some constants. Then T is 9, -Fréchet differentiable at F’. 

(iv) If, in addition to the conditions in part (i), J’ is continuous on [a, /], 
then T is second-order 0@..-Fréchet differentiable at F’ with 


vr(2,y) = br(a) + br(y) - [ee — F)(2)(6y — F)(2) J" (F(2))dz. 


(v) Suppose that J’ is continuous on [0,1]. Then T is second-order @,,,- 
Fréchet differentiable at F' with the same 7p given in (iv). 


5.2. Statistical Functionals 345 


Proof. We prove (i)-(iii). The proofs for (iv) and (v) are similar and are 
left to the reader. 

(i) Let G; € F and 0..(G;, F) — 0. Let c and d be two constants such that 
F(c) > 6 and F(d) < a. Then, for sufficiently large j, G;(x) € [0, aJU[G, 1] 
if x >cor « <d. Hence, for sufficiently large j, 


IR(G;,F)| = i “We, (2)(Gj — F)(a)de 


< o(Gy,F) | |Wa,(x)|dx. 
d 


Since J is continuous at F(a) when « ¢ D and D has Lebesgue measure 
0, Wa,(x) — 0 a.e. Lebesgue. By the dominated convergence theorem, 
ae |Wa,(x)|\dz — 0. This proves that T is @..-Fréchet differentiable. The 
assertions on ¢r can be proved by noting that 


maz i) (6. — F)(y) J(F(y) dy. 


(ii) From the proof of (i), we only need to show that 


| [ We, ((G, ~ Pyle) [eocsilG.F) 0 (5.49) 


where A = {x : F(x) < a or F(x) > G}. The quantity on the left-hand 
side of (5.49) is bounded by sup,¢ 4 |Wa,(x)|, which converges to 0 under 
the continuity assumption of J on [0, a] U [G, 1]. Hence (5.49) follows. 

(iii) The result follows from 


IR(G,F)| <¢ | |G(a) ~ F(e)|Pax = 0 (ler,(G, FY!) 
and the fact thatp>1. IJ 


An L-estimator with J(t) = 0 when t € [0, a] U[, 1] is called a trimmed 
L-estimator. Theorem 5.6(i) shows that trimmed L-estimators satisfy (5.41) 
and are robust in Hampel’s sense. In cases (ii) and (iii) of Theorem 5.6, 
(5.41) holds if Var(X1) < oo, but T(F;,) may not be robust in Hampel’s 
sense. It can be shown (exercise) that one or several of (i)-(v) of Theorem 
5.6 can be applied to each of the L-estimators in Example 5.6. 


M-estimators 


Let p(z,t) be a Borel function on R4? x R and © be an open subset of R. 
An M-functional is defined to be a solution of 


[ ole. 21G))aao) = min f plo, )\dG(), Ge Fo, (5.50) 


346 5. Estimation in Nonparametric Models 


where Fo contains all ere on R®@ for which the integrals in (5.50) are well 


defined. For X1,...,X, iid. from F € Fo, T(F;,) is called an M-estimator 
of T(F’). Assume that w(a,t) = Op(a,t)/Ot exists a.e. and 
eae x,t)dG(x =a fo (x, t)dG(a). (5.51) 


Then A¢(T(G)) = 


Example 5.7. The following are some examples of M-estimators. 

(i) If p(x, t) = (a — t)?/2, then ~(a,t) = t— 2; T(G) = f[ xdG(z) is the 
mean functional; and T(F;,) = X is the sample mean. 

(ii) If p(w, t) = |aw — t|P/p, where p € [1,2), then 


wen ={ 


When p = 1, T(F;,) is the sample median. When 1 < p < 2, T(F;,) is called 
the pth least absolute deviations estimator or the minimum L, distance 
estimator. 

(iii) Let Fo = {fo : 0 € O} be a parametric family of p.d.f.’s with O CR 
and p(z,t) = —log f;(a). Then T(F,) is an MLE. This indicates that M- 
estimators are extensions of MLE’s in parametric models. 

(iv) Let C > 0 be a constant. Huber (1964) considers 


4(x —t)? jc—t] <C 
p(z,t) = 
40? jc—t| >C 


|x —t|P-4 st 
—|x —t|P-1 Bb: 


with 
t-—2£ jr—t] <<C 


wet) = { 0 oe eee, 


The corresponding T(F;,) is a type of trimmed sample mean. 
(v) Let C > 0 be a constant. Huber (1964) considers 


+(x —t)? je—t]<C 
Az n={ Cea te =i 
an C t-“2>C 
W(a,th=< t-2 ja-—t]<C 
-—C t-a“2<-C. 
The corresponding T(F;,) is a type of Winsorized sample mean. 
(vi) Hampel (1974) considers 7(x,t) = ¢o(t — x) with ~o(s) = —yo(—s) 
and 
s O0<s<a 
a a<s<b 


ple aters) b<s<ec 


0 8>6, 


5.2. Statistical Functionals 347 


where 0 <a <b <c are constants. A smoothed version of w9 is 


sin(as) O0<s</a 


vile) = { 0 s>n/a. 


For bounded and continuous w, the following result shows that T is @0- 
Hadamard differentiable with a bounded and continuous influence function 
and, hence, T(F;,) satisfies (5.41) and is robust in Hampel’s sense. 


Theorem 5.7. Let T be an M-functional defined by (5.50). Assume that 
w is a bounded and continuous function on R¢ x R and that Ap(t) is 
continuously differentiable at T(F’) and \/,(T(F)) # 0. Then T is Q.0- 
Hadamard differentiable at F’ with 


or(@) = —¥(2, T(F))/Ne(T(F)). 


Proof. Let t; -= 0, A; E D, |A, = Alloo -= 0, and G; = F+t,Q; E F. 
Since Ag(T(G)) = 0, 


|Ar(T(G5)) — Ar (T(F))| = 


tj [ ¥(e,1(G)))ddj(0)] = 0 
by ||A; — Al|.o — 0 and the boundedness of 7. Note that /,(T(F’)) 4 0. 
Hence, the inverse of \r(t) exists and is continuous in a neighborhood of 
0=Ar(T(F)). Therefore, 

TG) = TF) 0: (5.52) 


Let he(TF)) = AXeT(F)), he) = Ar@® — Ar((F))I/[t — T(F)] if t # 


a 1 
Rij; = [eet aayte) Pena = hr(T(G;)) j 
1 
Raj = ini(G)) [wore — U(x, T(F))|dA; (2), 
and 1 
Lr(A) = ~SrERy f Vea AeD 
Then 


T(G;) — TP) = -Lp(tjA;) + t;(Rij — Raj). 


By (5.52), ||A; — Allo. — 0, and the boundedness of 7, Rj; — 0. The 
result then follows from R2; — 0, which follows from ||/A; — All. — 0 and 
the boundedness and continuity of w (exercise). Il 


348 5. Estimation in Nonparametric Models 


Some w functions in Example 5.7 satisfy the conditions in Theorem 
5.7 (exercise). Under more conditions on w, it can be shown that an M- 
functional is 0..-Fréchet differentiable at F (Clarke, 1986; Shao, 1993). 
Some M-estimators that satisfy (5.41) but are not differentiable functionals 
are studied in §5.4. 


Rank statistics and R-estimators 


Assume that X),...,X, are iid. from ac.d.f. F on R. The rank of X; 
among Xj,...,X», denoted by R;, is defined to be the number of X,’s 
satisfying X; < X;, 7 = 1,...,n. The rank of |X;| among |Xj1|,...,|Xn| is 
similarly defined and denoted by R;. A statistic that is a function of R,’s 
or R,’s is called a rank statistic. For G € F, let 
G(x) = G(x) -— G((—2)-), x> 0, 

where g(x—) denotes the left limit of the function g at x. Define a functional 
T by 


T(G) = i. J(G(a))dG(az), GeEF, (5.53) 


where J is a function on [0,1] with a bounded derivative J’. Then 


(Fa) = [ JPa(a))aFy (0) = 1577 (H) foa(X 


is a (one-sample) signed rank statistic. If J(t) = t, then T(F,,) is the well- 
known Wilcoxon signed rank test statistic (§6.5.1). 


Statistics based on ranks (or signed ranks) are robust against changes in 
values of x;’s, but may not provide efficient inference procedures, since the 
values of x;’s are discarded after ranks (or signed ranks) are determined. 


It can be shown (exercise) that T in (5.53) is 0..-Hadamard differentiable 
at F with the differential 


Lp(A) = ‘) 7 J! (F (a))A(«)dF (a) + i . J(F(x))dA(a), (5.54) 


where A € D and A(x) = A(x) — A((—2)-). 

These results can be extended to the case where X},..., Xn are i.i.d. 
from a c.d.f. F on R?. For any c.d.f. G on R?, let J be a function on [0,1] 
with J(1—t) = —J(t) and a bounded J’, 

G(y) = [Gly, 00) + G(oo, y)]/2, YER, 
and 


T(G) = [Gupacwy. oo). (5.55) 


5.2. Statistical Functionals 349 


Let X; = (Y;,Z;), Ri be the rank of Y;, and U; be the number of Z;,’s 
satisfying Z; < Y;,7=1,...,n. Then 


(Fa) = ff I(Fa(y))dFa(y.o0) = = > J (2st) 


is called a two-sample linear rank statistic. It can be shown (exercise) that 
T in (5.55) is Q..-Hadamard differentiable at F' with the differential 


nae / J! (F(y))A(y)dF(y, 00) + i) J(F(y))dA(y,00), (6.56) 


where A(y) = [A(y,00) + A(oo, y)]/2. 


Rank statistics (one-sample or two-sample) are asymptotically normal 
and robust in Hampel’s sense (exercise). These results are useful in testing 
hypotheses (86.5). 

Let F be a continuous c.d.f. on R symmetric about an unknown pa- 
rameter 6 € R. An estimator of 6 closely related to a rank statistic can be 
derived as follows. Let X; be ii.d. from F and W; = (X;,2t— X;) with a 
fixed t € R. The functional T in (5.55) evaluated at the c.d.f. of W; is equal 


to 
Ae(t) = f 7 (FH SFOERD) aP ie) (5.57) 


If J is strictly increasing and F is strictly increasing in a neighborhood of 
6, then Ar(t) = 0 if and only if t = 6 (exercise). For G € F, define T(G) to 
be a solution of 


[7 (ee Gt=2) aca) =0. (5.58) 


T(F;,) is called an R-estimator of T(F) = 6. When J(t) = t — $ (which is 
related to the Wilcoxon signed rank test), T(F;,) is the well-known Hodges- 
Lehmann estimator and is equal to any value between the two middle points 


of the values (X; + X;)/2,i=1,...,n, 7 =1,...,n 


Theorem 5.8. Let T be the functional defined by (5.58). Suppose that 
F is continuous and symmetric about 6, the derivatives F’ and J’ exist, 
and J’ is bounded. Then T is @..-Hadamard differentiable at F' with the 
influence function 

; “ ae 
~ TIF @)F @dFa) 
Proof. Since F' is symmetric about 0, F(a) + F(20— x) = 1. Under 


the assumed conditions, A(t) is continuous and f[ J’(F(«))F’(x)dF(«) = 
—X,(0) # 0 (exercise). Hence, the inverse of Ay exists and is continuous 


or( 


350 5. Estimation in Nonparametric Models 


at 0 = Ar(9). Suppose that t; —- 0, A; € D, ||/A; — All, — 0, and 
G; = F+t;A; € ¥. Then 
[(eGie.0) ~ JP le,A))]dG(e) + 0 
uniformly in t, where G(x,t) = [G(a) + 1 — G(2t — x)]/2, and 
[Ee d)aG;- PYe) = [P= G)(o)J' Fle, ))aF(e,1) +0 
uniformly in t. Let Ag(t) be defined by (5.57) with F' replaced by G. Then 
Aq, (t) — Ar(t) > 0 
uniformly in t. Thus, \r(T(G;)) — 0, which implies 
T(G;) — TF) = 0. (5.59) 


Let éa(t) = f J(F(x,t))dG(e), he(t) = [Ar() — rv O)I/(t — 8) if t F 8, 
and hr(9) = (0). Then T(G;) —T(F) — f ¢r(x)d(G; — F) (2x) is equal to 
1 1 Ar (T(G3)) = &a, (9) 


Ne) kr(T(G;)| 1 ar((G)) (5.60) 


&a, (9) | 
Note that 
a, (0) = / I(F («))aG, («) = t; / I(E(x)) dd; (2). 


By (5.59), Lemma 5.2, and ||/A;— Al|.. — 0, the first term in (5.60) is o(t;). 
The second term in (5.60) is the sum of 


Ee , [J (F(x, 1(G5))) — J(F(2))]dA;j (x) (5.61) 
and 
1 
he(T(G;)) / [J (F(x, T(G;))) — J(Gj (x, 1(G;)))]dG; (a). (5.62) 


From the continuity of J and F’,, the quantity in (5.61) is o(t;). Similarly, 
the quantity in (5.62) is equal to 
1 
he(T(G;)) 
From Taylor’s expansion, (5.59), and |/A; — All, — 0, the quantity in 
(5.63) is equal to 


[u@e.rG,y) — J(G;(#, 1(G;)))|dF (x) + oft;). (5.63) 


tj 


he(T(G;)) / JF (a))A(w, 0)dF (x) + o(t)). (5.64) 


5.3. Linear Functions of Order Statistics 351 


Since J(1 — t) = —J(t), the integral in (5.64) is 0. This proves that the 
second term in (5.60) is o(t;) and thus the result. 0 


It is clear that the influence function pf for an R-estimator is bounded 
and continuous if J and F are continuous. Thus, R-estimators satisfy (5.41) 
and are robust in Hampel’s sense. 


Example 5.8. Let J(t) =t— 4. Then T(F,,) is the Hodges-Lehmann esti- 
mator. From Theorem 5.8, dr (x) = [F(«)—$]/y, where y = f F’(x)dF(c). 
Since F'(X,) has a uniform distribution on [0,1], @¢r(X1) has mean 0 and 
variance (1277)~+. Thus, /n[T(Fn) — T(F)] a N(0,(1277)1). 0 


5.3 Linear Functions of Order Statistics 


In this section, we study statistics that are linear functions of order statis- 
tics X(1) < +++ < X(m) based on independent random variables Xj, ...,Xn 
(in §5.3.1 and §5.3.2, X1,..., Xp, are assumed i.id.). Order statistics, first 
introduced in Example 2.9, are usually sufficient and often complete (or 
minimal sufficient) for nonparametric families (Examples 2.12 and 2.14). 


L-estimators defined in §5.2.2 are in fact linear functions of order statis- 
tics. If T is given by (5.46), then 


T(F,) = if rie ara) = -> eater (5.65) 


since F,(X()) = i/n, i = 1,...,n. If J is a smooth function, such as those 
given in Example 5.6 or those satisfying the conditions in Theorem 5.6, the 
corresponding L-estimator is often called a smooth L-estimator. Asymp- 
totic properties of smooth L-estimators can be obtained using Theorem 5.6 
and the results in §5.2.1. Results on L-estimators that are slightly different 
from that in (5.65) can be found in Serfling (1980, Chapter 8). 

In §5.3.1, we consider another useful class of linear functions of order 
statistics, the sample quantiles described in the beginning of §5.2. In §5.3.2, 
we study robust linear functions of order statistics (in Hampel’s sense) 
and their relative efficiencies w.r.t. the sample mean X, an efficient but 
nonrobust estimator. In §5.3.3, extensions to linear models are discussed. 


5.3.1 Sample quantiles 


Recall that G~'(p) is defined to be inf{ax : G(x) > p} for any c.d.f. G on 
R, where p € (0,1) is a fixed constant. For iid. X4,...,X, from F, let 
6, = F~'(p) and 6, = F7'(p) denote the pth quantile of F and the pth 


352 5. Estimation in Nonparametric Models 


sample quantile, respectively. Then 


Oy = CnpX (mp) + (1 = Cnp)X(mp+1)> (5.66) 


where m, is the integer part of np, Cnp = 1 if np is an integer, and cy, = 0 
if np is not an integer. Thus, 0, is a linear function of order statistics. 


Note that F'(@,—) < p < F(@,) and F is not flat in a neighborhood of 
6, if and only if p < F(6, + €) for any € > 0. 


Theorem 5.9. Let Xj,..., Xn be iid. random variables from a c.d.f. F 
satisfying p < F(@,+€e) for any « > 0. Then, for every € > 0 andn = 1,2...., 
P(\6p — Op| > €) < 2Ce7?"%, (5.67) 
where 6, is the smaller of F'(@, + €) —p and p— F(@,—€) and C is the same 
constant in Lemma 5.1(i). 
Proof. Let ¢ > 0 be fixed. Note that G(x) > t if and only if x > G71(t) 
for any c.d.f. G on R (exercise). Hence 
P(6, > On + €) = P(p > Fr(Op + )) 
= P(F(6) +) — Fn(0p + ©) > F(Qp +) —p) 
< P(Q00(Fns F) > be) 
< Ce72nse 
where the last inequality follows from DKW’s inequality (Lemma 5.1(i)). 
Similarly, 
A 2 
P(8p < Op —€) < Cer. 
This proves (5.67). 


Result (5.67) implies that 6, is strongly consistent for 0, (exercise) and 
that 6, is /n-consistent for 0, if F’(0,—) and F’(0,+) (the left and right 
derivatives of F' at 6,) exist (exercise). 


The exact distribution of 6, can be obtained as follows. Since nF), (t) 
has the binomial distribution Bi(F(t),n) for any t € R, 


P(6, < t) = P(Fil(t) > p) 
=> ("\rorn-rer, 8) 
i=lp 


where |, = np if np is an integer and J, = 1+ the integer part of np if np 
is not an integer. If F has a Lebesgue p.d.f. f, then O, has the Lebesgue 
p.d.f. 


PC n(? : i (of — FP Fe). (5.69) 


5.3. Linear Functions of Order Statistics 353 


The following result provides an asymptotic distribution for /7(6,—9p). 


Theorem 5.10. Let Xj,..., X, be i.i.d. random variables from F’. 

(i) If F(O,) = p, then P(\/n(6, — 0,) < 0) > &(0) = 4, where ® is the 
c.d.f. of the standard normal. 

(ii) If F is continuous at 6, and there exists F’(@,—) > 0, then 


P(Vn(6p — 9) < t) + O(t/op), t <9, 


where op = \/p(1 — p)/F'(@p—). 


(iii) If F' is continuous at 6, and there exists F’(6,+) > 0, then 
P(/n(Op — 0p) <t) + @¢/of),  +#>0, 


where of = \/p(1 — p)/F"(Op+). 


iv) If F’(@,) exists and is positive, then 
P 
Vn(9> — 9») a N(0, 0%), (5.70) 


where op = \/p(1— p)/F' (6p). 
Proof. The proof of (i) is left as an exercise. Part (iv) is a direct conse- 
quence of (i)-(iii) and the proofs of (ii) and (iii) are similar. Thus, we only 
give a proof for (iii). 

Let t> 0, Pnt = F (9 + to), Cnt = Vn(Dnt ~ P)/Vv Dnt (1 — Dnt); 
and Znt = [Bn(Pnt)—NPnil/V 2Pnt(1 — pnt), where B,(q) denotes a random 
variable having the binomial distribution Bi(q,n). Then 


P(bp < Gp + topn-V/?) = P(p < Fy(0p + tozn-/?)) 
7 P(Zne 2 —Cnt)- 


Under the assumed conditions on F’, pnz — p and cy, — t. Hence, the 
result follows from 


P(Znt < Cnt) — ®(—cnt) > 0. 


But this follows from the CLT (Example 1.33) and Polya’s theorem (Propo- 
sition 1.16). 1 


If both F’(@,—) and F’(6,+) exist and are positive, but F’(0,—) # 
F'(0,+), then the asymptotic distribution of //n(@, — 6p) has the c.d-f. 
®(t/o7)[(—co,) (t) + ®(t/of)I[0,0)(t), a mixture of two normal distribu- 
tions. An example of such a case when p = 4 is 


F(a) = aig, 1)(2) + (2a = 5 )q\a,2)(2) + 113 00) (2). 


354 5. Estimation in Nonparametric Models 


When £”(0,—) = F’(0,+) = F’(0,) > 0, (5.70) shows that the asymptotic 
distribution of \/n(6,—9>) is the same as that of \/n[Fn (6p) — F(4p)|/F’ (Op) 
(see (5.2)). The following result reveals a stronger relationship between 
sample quantiles and the empirical c.d.f. 


Theorem 5.11 (Bahadur’s representation). Let Xj, ..., X, be i.i.d. random 
variables from F’. Suppose that F’(6,) exists and is positive. Then 


b= O54 Se ae + op (Jz). (5.71) 


Proof. Let t € R, Ont = Op ttn—/?, Z,(t) = /n[F (Ont) — Fn(Ont)|/F' (Op), 
and U;,(t) = /n[F (Ont) — Fn(p)|/F’ (Op). It can be shown (exercise) that 


Z(t) — Zn(0) = 0(1). (5.72) 
Note that |p — F,(6p)| <n7!. Then 


Un(t) = VnlF Ont) — p+ — Fn(8p)|/F" Gp) 
= Vn[F (Ont) — P/F" (Gp) + O(n) 
= t. (5.73) 


P(En <t,Zn(0) >t+) = P(Z,(t) < Un(t), Zn(0) >t +e) 


+ P(|U,(t) — t| > €/2) 
= 0 
by (5.72) and (5.73). Similarly, 
P(E, >t+e,Z,(0) < t) — 0. (5.75) 


It follows from the result in Exercise 128 of §1.6 that 
En — Zn(0) = op(1), 
which is the same as (5.71). I 


If F has a positive Lebesgue p.d.f., then 6, viewed as a statistical func- 
tional (§5.2) is 0..-Hadamard differentiable at F' (Fernholz, 1983) with the 
influence function 


or (@) = [F() — T(-00,0,](%)]/F" (8p). 


5.3. Linear Functions of Order Statistics 355 


This implies result (5.71). Note that dr is bounded and is continuous 
except when x = 4). 


Corollary 5.1. Let Xj,..., Xp, be i.i.d. random variables from F’ having 
positive derivatives at 0,,, where 0 < py <+++ < pm < 1 are fixed constants. 
Then : . 

Vn{(p.; se Pom) ~~ (Op, gseed Pom )| ~d Nm (0, D), 


where D is the m x m symmetric matrix whose (i, j)th element is 


pill — D5)/[F" (Op, )F'(Op;)], tS. 0 


The proof of this corollary is left to the reader. 


Example 5.9 (Interquartile range). One application of Corollary 5.1 is the 
derivation of the asymptotic distribution of the interquartile range 60.75 - 
Coats The interquartile range is used as a measure of the variability among 
X;’s. It can be shown (exercise) that 


Vn{(0.75 — 90.25) — (80.75 — 90.25)] +a N(0, 0?) 
with 


3 3 1 


16[F" (80.75)? . 16[F"(00.25)|2 8F" (60.75) F" (80.25) : 


on 


There are some applications of using extreme order statistics such as 
X (1) and X(,). One example is given in Example 2.34. Some other examples 
and references can be found in Serfling (1980, pp. 89-91). 


5.3.2 Robustness and efficiency 


Let F be a c.d-f. on R symmetric about 6 € R with F’(0) > 0. Then 
0 = 00.5 and is called the median of F’. If F has a finite mean, then 6@ is also 
equal to the mean. In this section, we consider the estimation of @ based 
on i.i.d. X;’s from F’. 

If F is normal, it has been shown in previous chapters that the sample 
mean X is the UMVUE, MRIE, and MLE of 0, and is asymptotically 
efficient. On the other hand, if F' is the c.d.f. of the Cauchy distribution 
C(0,1), it follows from Exercise 78 in §1.6 that X has the same distribution 
as Xj, i.e., X is as variable as X1, and is inconsistent as an estimator of 0. 

Why does X perform so differently? An important difference between 
the normal and Cauchy p.d.f.’s is that the former tends to 0 at the rate 


356 5. Estimation in Nonparametric Models 


e~*’/2 ag |x| > oo, whereas the latter tends to 0 at the much slower rate 
z~*, which results in f |a|dF (x) = oo. The poor performance of X in the 
Cauchy case is due to the high probability of getting extreme observations 
and the fact that X is sensitive to large changes in a few of the X;’s. (Note 
that X is not robust in Hampel’s sense, since the functional [ «dG(a) has 
an unbounded influence function at F’.) This suggests the use of a robust 
estimator that discards some extreme observations. The sample median, 
which is defined to be the 50%th sample quantile 60.5 described in §5.3.1, 
is insensitive to the behavior of F' as |x| — oo. 


Since both the sample mean and the sample median can be used to 
estimate 0, a natural question is when is one better than the other, using 
a criterion such as the amse. Unfortunately, a general answer does not 
exist, since the asymptotic relative efficiency between these two estimators 
depends on the unknown distribution F’. If F does not have a finite vari- 
ance, then Var(X) = oo and X may be inconsistent. In such a case the 
sample median is certainly preferred, since 6.5 is consistent and asymptot- 
ically normal as long as F’(@) > 0, and may have a finite variance (Exercise 
60). The following example, which compares the sample mean and me- 
dian in some cases, shows that the sample median can be better even if 
Var(X1) < Ow. 


Example 5.10. Suppose that Var(X 1) < oo. Then, by the CLT, 


Vn(X — 8) >a N(0, Var(X1)). 
By Theorem 5.10(iv), 
Vi(b0.5 — 8) a N(0, [2F’(@)]~”). 
Hence, the asymptotic relative efficiency of 60.5 w.r.t. X is 
e(F) = 4[F’(0)|?Var(X1). 


(i) If F is the c.d.f. of N(0,07), then Var(X1) = 0?, F’(0) = (V2ra)-}, 
and e(F') = 2/a = 0.637. 

(ii) If F is the c.d.f. of the logistic distribution LG(0,0), then Var(X,) = 
o?n?/3, F'(0) = (40)71, and e(F’) = 17/12 = 0.822. 

(iii) If F(a) = Fo(a — 0) and Fo is the c.d.f. of the t-distribution t, with 
vy > 3, then Var(X1) = v/(v — 2), F’(0) =1(44*)/[Vvnl($)], e(F) = 1.62 
when v = 3, e(F’) = 1.12 when v = 4, and e(F’) = 0.96 when v = 5. 

(iv) If F is the c.d.f. of the double exponential distribution DE(@,0), then 
F'(0) = (20)71 and e(F) = 2. 

(v) Consider the Tukey model 


F(a) = (1—)® (=) + (4), (5.76) 


5.3. Linear Functions of Order Statistics 357 


where o > 0,7 > 0, and 0 <e <1. Then Var(X1) = (1— om + e707, 
F'(0) = (1-—e+€/7)/(V 270), and e(F) = 2(1 — e+ er?) (1 + e/T)?/n. 
Note that lim.9 e(F’) = 2/m and lim;_..e(F)=o0o. I 


Since the sample median uses at most two actual values of x;’s, it may 
go too far in discarding observations, which results in a possible loss of 
efficiency. The trimmed sample mean introduced in Example 5.6(iii) is a 
natural compromise between the sample mean and median. Since F' is 
symmetric, we consider 3 = 1 — a in the trimmed mean, which results in 
the following L-estimator: 


n-—Ma 


X, = an Ss) Xj, (5.77) 


J=Matl 


where mq is the integer part of na and a € (0, 4). The estimator in (5.77) 
is called the a-trimmed sample mean. It discards the mg smallest and mq 
largest observations. The sample noe and median can be viewed as two 
extreme cases of Xq as a — 0 and 4 =, respectively. 


It follows from Theorem 5.6 that if F(x) = Fo(a — 0), where Fo is 
symmetric about 0 and has a Lebesgue p.d.f. positive in the range of X), 
then 

Vn(Xq — 9) +4 N(0,07), (5.78) 


Fy *(1-a) 
ae aaa 1 2?dFy(x) + a[Fo3(1 — | : 


Lehmann (1983, §5.4) provides various values of the asymptotic relative 
efficiency ex, ¢(F) = Var(X1)/o2. For instance, when F(x) = Fo(x — 0) 
and Fo is the c.d.f. of the t-distribution tz, ex, ¢(F) = 1.70, 1.91, and 1.97 
for a = 0.05, 0.125, and 0.25, respectively; when F' is given by (5.76) with 
r =3 and € = 0.05, ex, ¢(F) = 1.20, 1.19, and 1.09 for a = 0.05, 0.125, 
and 0.25, respectively; when F is given by (5.76) with r = 3 and « = 0.01, 
ex, x(F) = 1.04, 0.98, and 0.89 for a = 0.05, 0.125, and 0.25, respectively. 

Robustness and efficiency of other L-estimators can be discussed simi- 


larly. For an L-estimator T(F;,,) with T given by (5.46), if the conditions in 
one of (i)-(iii) of Theorem 5.6 are satisfied, then (5.41) holds with 


where 


ob= ff sR) IW) [Flmin{e,y}) — F@)F)ldedy, (6:79) 


provided that o7 F < oo (exercise). If F' is symmetric about 0, J is symmetric 
about 4, and feat t)dt = 1, then T(F) = @ (exercise) and, therefore, the 
asymptotic relative ae of T(F,) w.r.t. X is Var(X1)/o4. 


358 5. Estimation in Nonparametric Models 


5.3.3 L-estimators in linear models 


In this section, we extend L-estimators to the following linear model: 
X; = BL; + ej, = 1, see TL, (5.80) 


with iid. e;’s having an unknown c.d.f. Fo and a full rank Z whose ith 
row is the vector Z;. Note that the c.d.f. of X; is Fo(a — 67 Z;). Instead of 
assuming E'(e;) = 0 (as we did in Chapter 3), we assume that 


if J (Fo(2))dFo(a) = 0, (5.81) 


where J is a Borel function on [0, 1] (the same as that in (5.46)). Note that 
(5.81) may hold without any assumption on the existence of E(e;). For 
instance, (5.81) holds if Fy is symmetric about 0, J is symmetric about 4, 
and de J(t)dt = 1 (Exercise 69). 

Since X;’s are not identically distributed, the use of the order statistics 
and the empirical c.d.f. based on Xj,..., Xn may not be appropriate. In- 
stead, we consider the ordered values of residuals r; = X;—Z7 8,71 =1,...,n, 
and some empirical c.d.f.’s based on residuals, where 3 = (Z7Z)~!Z7X is 
the LSE of ( (§3.3.1). 

To illustrate the idea, let us start with the case where @ and Z; are 
univariate. First, assume that Z; > 0 for all i (or Z; <0 for all i). Let Fo 
be the c.d.f. putting mass Z;/ 37>", Z; at rj, i = 1,...,n. An L-estimator of 
G is defined to be 


6, = 8+ f x1(Fala)aty(e) 2/022 


When J(t) = (1 — 2a)7"J(a,1-2)(#) with an a € (0, 5), 1 is similar to the 
a-trimmed sample mean in the i.i.d. case. 


If not all Z;’s have the same sign, we can define L-estimators as follows. 
Let Z;* = max{Z;,0} and Z> = Z* — Z;. Let For be the c.d-f. putting 
mass Z7*/>;_, Z at r;,i=1,...,n. An L-estimator of 3 is defined to be 


b= + f ese nary od zt / oz? 
— [ene whats ee) 2 [2b 


For a general p-vector Z;, let z;; be the jth component of Z;, 7 = 1,..., p. 


Let ze = max{z;,,0}, Zi, = ae — zj, and FS be the c.d.f. putting mass 


5.4. Generalized Estimating Equations 359 


zig/ Vina 2; at ri, i=1,...,n. For any j, if zj7 > 0 for alli (or %; < 0 for 


all 7), then we set Py = 0 (or Fy; = 0). An L-estimator of @ is defined to 


6, =B+(Z"Z) (At - A>), (5.82) 
where 
A* = ( / aJ (EX (x) )dFX (2) Ss BP tes i) oJ (FG, (x) dF, (x) Ds “) 


Obviously, 8, in (5.82) reduces to the previously defined Br when @ and 
Z;, are univariate. 


Theorem 5.12. Assume model (5.80) with iid. ¢,’s from a c.d.f. Fo 
satisfying (5.81) for a given J. Suppose that Fo has a uniformly continuous, 
positive, and bounded derivative on the range of €,;. Suppose further that 
the conditions on Z;’s in Theorem 3.12 are satisfied. 

(i) If the function J is continuous on (aj, @2) and equals 0 on [0, a1]U[aze, 1], 
where 0 < a, < a2 < 1 are constants, then 


on. (Z7Z)"?(Br — B) a Np(0, Ip), (5.83) 


where a7, is given by (5.79) with F = Fo. 
(ii) Result (5.83) also holds if J’ is bounded on [0,1], Ele1| < 00, and o%, 
is finite. 


The proof of this theorem can be found in Bickel (1973). Robustness 
and efficiency comparisons between the LSE @ and L-estimators 3; can be 
made in a way similar to those in §5.3.2. 


5.4 Generalized Estimating Equations 


The method of generalized estimating equations (GEE) is a powerful and 
general method of deriving point estimators, which includes many previ- 
ously described methods as special cases. In §5.4.1, we begin with a descrip- 
tion of this method and, to motivate the idea, we discuss its relationship 
with other methods that have been studied. Consistency and asymptotic 
normality of estimators derived from generalized estimating equations are 
studied in 85.4.2 and §5.4.3. 


Throughout this section, we assume that Xj,...,X, are independent 
(not necessarily identically distributed) random vectors, where the dimen- 
sion of X; is d;, i = 1,...,n (sup;d; < oo), and that we are interested in 
estimating 0, a k-vector of unknown parameters related to the unknown 
population. 


360 5. Estimation in Nonparametric Models 


5.4.1 The GEE method and its relationship with others 


The sample mean and, more generally, the LSE in linear models are solu- 
tions of equations of the form 


Also, MLE’s (or RLE’s) in §4.4 and, more generally, M-estimators in §5.2.2 
are solutions to equations of the form 


n 


S- (Xi, 7) = 0. 


i=1 


This leads to the following general estimation method. Let © C R* be the 
range of 6, ~; be a Borel function from R% x © to R*, i =1,...,n, and 


$n(7) = » vi(Xi,7), ye. (5.84) 


If 6 is estimated by 6 € © satisfying s,(6) = 0, then @ is called a GEE 
estimator. The equation s,,(7) = 0 is called a GEE. Apparently, the LSE’s, 
RLE’s, MQLE’s, and M-estimators are special cases of GEE estimators. 


Usually GEE’s are chosen so that 


Elsn(9)] = >_> Eldi(Xi,9)] = 0, (5.85) 


i=1 
where the expectation E’ may be replaced by an asymptotic expectation 
defined in §2.5.2 if the exact expectation does not exist. If this is true, 
then 6 is motivated by the fact that s,(0) = 0 is a sample analogue of 
E|sn(9)| = 0. 

To motivate the idea, let us study the relationship between the GEE 
method and other methods that have been introduced. 


M-estimators 


The M-estimators defined in §5.2.2 for univariate 9 = T(F’) in the i.i.d. case 
are special cases of GEE estimators. Huber (1981) also considers regression 
M-estimators in the linear model (5.80). A regression M-estimator of 6 is 
defined as a solution to the GEE 


So W(X - 7" Zi) Zi = 0, 
t=1 


where w is one of the functions given in Example 5.7. 


5.4. Generalized Estimating Equations 361 


LSE’s in linear and nonlinear regression models 


Suppose that 
X; = f (Zi, 0) + &, — Ds ee, (5.86) 


where Z;,’s are the same as those in (5.80), 6 is an unknown k-vector of 
parameters, f is a known function, and ¢,;’s are independent random vari- 
ables. Model (5.86) is the same as model (5.80) if f is linear in 6 and is 
called a nonlinear regression model otherwise. Note that model (4.64) is a 
special case of model (5.86). The LSE under model (5.86) is any point in 
© minimizing )7"_, [Xi — f(Zi,y)}? over y € O. If f is differentiable, then 
the LSE is a solution to the GEE 


2 Of(Z; 
Ys = 12,7) LED ~ 0. 


i=1 


Quasi-likelihoods 


This is a continuation of the discussion of the quasi-likelihoods introduced 
in §4.4.3. Assume first that X;’s are univariate (d; = 1). If X;’s follow a 
GLM, i.e., X; has the p.d.f. in (4.55) and (4.57) holds, and if (4.58) holds, 
then the likelihood equation (4.59) can be written as 


> wT Hi Gy =; (5.87) 


where pi(y) = MY" Zi), Gily) = Oma(y)/O7, vil) = Var(Xi)/¢, and 
we have used the following fact: 


WH) = HYG O)G YO = (GY O/c"(b). 


Equation (5.87) is a quasi-likelihood equation if either X; does not have 
the p.d.f. in (4.55) or (4.58) does not hold. Note that this generalizes the 
discussion in §4.4.3. If X; does not have the p.d.f. in (4.55), then the 
problem is often nonparametric. Let s,,(7) be the left-hand side of (5.87). 
Then s,(y) = 0 is a GEE and E[s,,(3)] = 0 is satisfied as long as the first 
condition in (4.56), E(X;) = 44; (3), is satisfied. 

For general d;’s, let X; = (Xiu,..., Xia;), 7 = 1,...,.n, where each Xj, 
satisfies (4.56) and (4.57), ie., 


E(Xit) = wnt) = 97 (87 Zit) and Var(Xie) = ditt’ (nie), 


and Z,’s are k-vector values of covariates. In biostatistics and life-time 
testing problems, components of X; are repeated measurements at different 
times from subject 7 and are called longitudinal data. Although X;’s are 


362 5. Estimation in Nonparametric Models 


assumed independent, X;;’s are likely to be dependent for each 7. Let R; 
be the d; x d; correlation matrix whose (t,/)th element is the correlation 
coefficient between X;, and Xj;,. Then 


Var(X;) = $;[D:(8)]'/? Ri[Di(8)|'/”, (5.88) 


where D;(7y) is the d; x d; diagonal matrix with the tth diagonal element 
(g-')'(y7 Zin). If R;’s in (5.88) are known, then an extension of (5.87) to 
the multivariate x;’s is 


n 


DE GID? Bali)?" zs — sa (Y)] = 0, (5.89) 


i=l, 


where pi(y) = (u(y Zi1)), + MW" Zia, ))) and Gi(y) = Opui(y)/Oy. In 
most applications, R; is unknown and its form is hard to model. Let R; bea 
known correlation matrix (called a working correlation matriaz). Replacing 
R; in (5.89) by R; leads to the quasi-likelihood equation 


n 


YE Gi({Di( MP? BilDi (1/7 } [es — wa(Y)] = 0. (5.90) 


i=l 


For example, we may assume that the components of X; are independent 
and take R; = Iq,. Although the working correlation matrix R; may not be 
the same as the true unknown correlation matrix R;, an MQLE obtained 
from (5.90) is still consistent and asymptotically normal (§5.4.2 and 85.4.3). 
Of course, MQLE’s are asymptotically more efficient if R; is closer to R;. 
Even if R; = R; and ¢; = 4, (5.90) is still a quasi-likelihood equation, since 
the covariance matrix of X; cannot determine the distribution of X; unless 
X; is normal. 


Since an R; closer to R; results in a better MQLE, sometimes it is 
suggested to replace R; in (5.90) by R;, an estimator of R; (Liang and 
Zeger, 1986). The resulting equation is called a pseudo-likelihood equation. 
As long as max;<p, || R; — U;|| +p 0 as n > oo, where || Al] = /tr(A7 A) for 
a matrix A and U; is a correlation matrix (not necessarily the same as R;), 
i =1,...,n, MQLE’s are consistent and asymptotically normal. 


Empirical likelihoods 


The previous discussion shows that the GEE method coincides with the 
method of deriving M-estimators, LSE’s, MLE’s, or MQLE’s. The following 
discussion indicates that the GEE method is also closely related to the 
method of empirical likelihoods introduced in 85.1.4. 


Assume that X;’s are i.i.d. from ac.df. F on R@ and y; = w for all i. 
Then condition (5.85) reduces to E[¢)(X1,0)| = 0. Hence, we can consider 


5.4. Generalized Estimating Equations 363 


the empirical likelihood 


eG) = [] Pelt). GeF 


i=1 


subject to 
p20, So p=l, and S pip(ai,0) =0, (5.91) 
i t=1 


where p; = Pe({x;i}). However, in this case the dimension of the function 
w is the same as the dimension of the parameter @ and, hence, the last 
equation in (5.91) does not impose any restriction on p,’s. Then, it follows 
from Theorem 5.3 that (pi,...,Pn) = (n7',...,n~') maximizes €(G) for any 


fixed 0. Substituting pj = n~+ into the last equation in (5.91) leads to 
1 n 
n ¢ 
i=l 
That is, any MELE 6 of 6 is a GEE estimator. 


5.4.2 Consistency of GEE estimators 


We now study under what conditions (besides (5.85)) GEE estimators are 
consistent. For each n, let 6, be a GEE estimator, i.e., $,(@,) = 0, where 
Sn(y) is defined by (5.84). 

First, Theorem 5.7 and its proof can be extended to multivariate T in a 
straightforward manner. Hence, we have the following result. 


Proposition 5.2. Suppose that X,....X, are iid. from F and yy; = 
w, a bounded and continuous function from R4 x © to R*. Let W(t) = 
| (a, t)dF (x). Suppose that Y(0) = 0 and OWV(t)/dt exists and is of full 
rank at t = 6. Then 6, —,O. tI 


For unbounded w in the i.i.d. case, the following result and its proof can 
be found in Qin and Lawless (1994). 


Proposition 5.3. Suppose that Xj,...,X, are iid. from F and w; = wW. 
Assume that v(x, y) = 0wW(2,7)/07 exists in Ng, a neighborhood of 0, and 
is continuous at 0; there is a function h(x) such that sup,ey, ||~(z,7)|| < 
A(x), supyen, ||/Y(a, V)|I* < h(w), and E[h(X1)] < 00; Ely(X1,4)] is of full 
rank; E’{¢(X1, 0)[v(X1, 6)]"} is positive definite; and (5.85) holds. Then, 
there exists a sequence of random vectors {6,,} such that 


P(sn(On)=0) 71 and 6,—,0. I 5.92 
Pp 


364 5. Estimation in Nonparametric Models 


Next, we consider non-i.i.d. X;’s. 


Proposition 5.4. Suppose that Xj,...,X, are independent and @ is uni- 
variate. Assume that w;(x, 7) is real-valued and nonincreasing in y for all 
i; there is a 6 > 0 such that sup; E|q;(X;,7)|!t° < 00 for any y in No, a 
neighborhood of @ (this condition can be replaced by E|q)(X1,)| < oo for 
any y in Ng when X;’s are iid. and yy; = w); vi(x,7) are continuous in 
No; (5.85) holds; and 


lim sup E[W,,(6@ + €)] < 0 < liminf E[W,,(6 — e)| (5.93) 


for any € > 0, where W,,(y) = n~!s,,(y). Then, there exists a sequence of 
random variables {6,,} such that (5.92) holds. Furthermore, any sequence 
{6n} satisfying sy (0,) = 0 satisfies (5.92). 

Proof. Since w;’s are nonincreasing, the functions V,,(y) and E/W,,(y)] are 
nonincreasing. Let € > 0 be fixed so that 6 +e€ € Ng. Under the assumed 
conditions, 


U, (Otc) — E[V, (0+ ©)] p 0 


(Theorem 1.14(ii)). By condition (5.93), 
P(W,(0+€) <0< U,(0-6) 91. 
The rest of the proof is left as an exercise. I 


To establish the next result, we need the following lemma. First, we 
need the following concept. A sequence of functions {g;} from R* to R* 
is called equicontinuous on an open set O C R* if and only if, for any 
€ > 0, there is a 6. > 0 such that sup; ||gi(t) — gi(s)|| < € whenever t € 
O, s € O, and ||t — s|| < 6,. Since a continuous function on a compact 
set is uniformly continuous, functions such as g;(y) = g(t:,y) form an 
equicontinuous sequence on Q if ¢,’s vary in a compact set containing O 
and g(t, y) is a continuous function in (t, 7). 


Lemma 5.3. Suppose that © is a compact subset of R*. Let h;(X;) = 
sup,ce |lvi(Xi,7)||, 7 = 1,2,..... Suppose that sup; E|h;(X;)|'*° < co and 
sup; E||X;||> < oo for some 6 > 0 (this condition can be replaced by 
E|h(X1)| < co when X;’s are iid. and yy; = w). Suppose further that 
for any c > 0 and sequence {2;} satisfying ||2;|| < c, the sequence of func- 
tions {g:(7) = wi (ai, y)} is equicontinuous on any open subset of 0. Then 


= (vi(Xi7) - Blox )D| —p 0. 


i=l 


sup 
yEOo 


5.4. Generalized Estimating Equations 365 


Proof. Since we only need to consider components of w;’s, without loss of 
generality we can assume that 7;’s are functions from R” x © to R. For 
any c>0, 


sup Ei) — oes I(c,00) (| Xill) < sup Elhi(Xi)I(e,c0) (I|Xel)). 


Let co = sup; Elh;(X;)|!*° and c, = sup, E||X;||°. By Holder’s inequality, 


1/(1+6) 5 5 
Elhi(Xi)I(e,00) (II Xill)] < [E]hi(X2)|4*?] [P(|Xil] > Q)°/ Or? 
< ft é $/C.+8) 87 /(1+6) 


for allz. For e > 0 and € > 0, choose a c such that col OF) BOF) 6-8? /(148) 


< «€/4. Then, for any O C 0, the probability 
hae € 
P{—- su i (XG, — inf y; Xi, Lee.c0) (|| Xa|]) > = 5.94 
(2d {mew 1) = ing, ul | (o00)(IXell) 5) (5.94) 


is bounded by € (exercise). From the equicontinuity of {v;(x;,y)}, there is 
a 6. > 0 such that 


Mi (E06 


lg : 
= 2, sup Wi(Xi, 7) — a won| Tio, (Xill) < 5 


for sufficiently large n, where 0, denotes any open ball in R* with radius 
less than 6,-. These results, together with Theorem 1.14(ii) and the fact 


P (2 3 sup ~i(Xi,7) —E ing C7) > ‘ —0. (5.95) 


yEeVe 


Let Hn(y) = 27) i {il Xi, 7) — Eli (Xi, y)]}- Then 
1 n 

sup Hy(y) < — sup Wi(Xi,y) — £ | inf Wi(Xi, , 

sup Hal) < 5 » {ap (Xi,7) — E| inf di(Xi,) 
which with (5.95) implies that 

P(Hn(y) >€ for ally € 0.) =P (sp Hn(y) > ‘ > 0. 

yee 

Similarly we can show that 


P(An() <-e forallye€ O-) => 0 


366 5. Estimation in Nonparametric Models 


Since © is compact, there exists m. open balls O,,; such that O C UO, ;. 
Then, the result follows from 


P (sup ta |H,(y)| > Js ~P ( sup |H,(7)| > ‘ —0. I 
yE9 


yE06,5 


Example 5.11. Consider the quasi-likelihood equation (5.90). Let {R;} 
be a sequence of working correlation matrices and 


bile) = GilY) [Di Ri[Dil)] 7} as — ws ()- (5.96) 


It can be shown (exercise) that ~;’s satisfy the conditions of Lemma 5.3 if 
© is compact and sup, ||Z;|| << oo. 


Proposition 5.5. Assume (5.85) and the conditions in Lemma 5.3 (with © 
replaced by any compact subset of the parameter space). Suppose that the 
functions A,,(y) = E[n~'s,(7)] have the property that limp; An(y) = 0 
if and only if y = 6. (If A, converges to a function A, then this condition 
and (5.85) imply that A has a unique 0 at 6.) Suppose that {6,} is a 
sequence of GEE estimators and that 6, = O (1). Then 6, p 8. 
Proof. First, assume that © is a compact subset of R*. Eicon: Lemma 5.3 
and sn(6n) = 0, An(On) —, 0. By Theorem 1.8(vi), there is a subsequence 
{n;} such that 

An,(9n;) a.s. 0: (5.97) 
Let 21,%2,... be a fixed sequence such that (5.97) holds and let 09 be a 
limit point of {4,,}. Since © is compact, 0) € © and there is a subsequence 
{mj} C {ni} such that Gi — 69. Using the argument in the proof of 
Lemma 5.3, it can be shown (exercise) that {A,,(7)} is equicontinuous on 
any open subset of O. Then 


Am; G23 =o Am; (Ao) = 0, 


which with (5.97) implies A,,,(99) — 0. Under the assumed condition, 
0) = 9. Since this is true for any limit point of {6n}, 6 yp 0. 
Next, consider a general ©. For any € > 0, there is an M, > 0 such 


that P(||On|| <M.) > 1—e. The result follows from the previous proof by 
considering the closure of ON {y: ||y|| < M.} as the parameter space. I 


Condition 6, = O,(1) in Proposition 5.5 is obviously necessary for the 
consistency of 6,,. It has to be checked in any particular problem. 

If a GEE is a likelihood equation under some conditions, then we can 
often show, using an argument similar to the proof of Theorem 4.17 or 4.18, 
that there exists a consistent sequence of GEE estimators. 


5.4. Generalized Estimating Equations 367 


Proposition 5.6. Suppose that s,(y) = Ologé,(y)/Oy for some func- 
tion ln; Dn(0) = Var(sn(0)) > 0; vila, y) = Ovi(x,y)/Oy exists and the 
sequence of functions {y;;,i = 1,2,...} satisfies the conditions in Lemma 
5.3 with © replaced by a compact neighborhood of @, where ;; is the jth 
row of yi, j = 1,...,k; —liminf,[D,(0)]1/? E[Vsn(9)][Dn()]!/2 is positive 
definite, where Vs5n(7) = O8n(y)/O7; and (5.85) holds. Then, there exists 
a sequence of estimators {6} satisfying (5.92). Il 


The proof of Proposition 5.6 is similar to that of Theorem 4.17 or The- 
orem 4.18 and is left as an exercise. 


Example 5.12. Consider the quasi-likelihood equation (5.90) with R; = 
Iq, for all i. Then the GEE is a likelihood equation under a GLM (84.4.2) 
assumption. It can be shown (exercise) that the conditions of Proposition 
5.6 are satisfied if sup, ||Z;||<oo. I 


5.4.3 Asymptotic normality of GEE estimators 


Asymptotic normality of a consistent sequence of GEE estimators can be 
established under some conditions. We first consider the special case where 
@ is univariate and Xj,..., Xy are i.i.d. 


Theorem 5.13. Let Xj,...,X, be iid. from F, y; = yw, and #0 © R. 
Suppose that U(y) = f v(a,7)dF(x) = 0 if and only if y = 0, W’(@) exists 
and W’(0) £0. 

(i) Assume that w(2,7) is nonincreasing in y and that [[w(z, y)|?dF(z) 
is finite for y in a neighborhood of 6 and is continuous at @. Then, any 
sequence of GEE estimators (M-estimators) {6,} satisfies 


Vn(bn — 0) +4 N(0, 02), (5.98) 
where 
o2, = (ww, 8) 2 dF («)/[0"(@). 


(ii) Assume that f[w(a,6)]?dF (x) < co, ¥(x,7¥) is continuous in x, and 
lim,—o ||W(-, y) — ¥-, 8) ||v = 0, where || - ||v is the variation norm defined 


in Lemma 5.2. Then, any consistent sequence of GEE estimators {6} 
satisfies (5.98). 
Proof. (i) Let U,(y7) =n7!sn(y). Since W,, is nonincreasing, 


P(Wn(t) <0) < P(On < t) < P(Wn(t) < 0) 
for any t € R. Then, (5.98) follows from 
lim P(Wn(tn) <0) = lim P(Wn(tn) < 0) = &(t) 


368 5. Estimation in Nonparametric Models 


for all t € R, where th = 6+ tegn, Let ae = Var(#(X1,tn)) and 
Yni = [W(Xi, tn) — U(tn)]/Stn. Then, it suffices to show that 


lim P (O% z th = O(t) 


for all t. Under the assumed conditions, nV (tn) > W'(@)to, and sin 
—W'(9)o,. Hence, it suffices to show that 


Vin # 


Note that Yni,..-,Ynn are iid. random variables. Hence we can apply 
Lindeberg’s CLT (Theorem 1.15). In this case, Lindeberg’s condition (1.92) 
is implied by 


1 n 
— 55 Yni a N(0,1). 
i=l 


lim [w(x tn)|°dF (x) = 0 

NCS S| tb (a tn)|>Vne 
for any « > 0. For any 7 > 0, #(2,64+ 7) < (2, tn) < w(a, 6-7) for all x 
and sufficiently large n. Let u(x) = max{|w(a, 6 — 7)|, |¢(x, 9+ )|}. Then 


[w(, tn) [PdF (a) < / [u(2)|2dF (cx), 


ree u(x) >/ne 


which converges to 0 since f [(a, y)|?dF (a) is finite for y in a neighborhood 
of @. This proves (i). 

(ii) Let @r(x) = —(a, 0)/V'(0). Following the proof of Theorem 5.7, we 
have 


Vn(6n — 8) = = S- br(X;) + Rin — Ran, 
i=1 


where 
1 1 


W'(0) — Ap(On) 


1 n 
Rina X;,6 
m= FEL VHA) 


espns t,4n) — W(x — F\(a 
= / bb(2, 6) — (a, dF, — F)(a), 


and hr is defined in the proof of Theorem 5.7 with VW = Ar. By the CLT 


and the consistency of 6,, Rin = 0p(1). Hence, the result follows if we can 
show that Re, = 0,(1). By Lemma 5.2, 


Ran 


|Ran| < Vn|hp(On)|* 000(Fns FY YC, On) — 0, A)Ilv- 


The result follows from the assumed condition on w and the fact that 
VNOco(En, F) = Op(1) (Theorem 5.1). 0 


5.4. Generalized Estimating Equations 369 


Note that the result in Theorem 5.13 coincides with the result in The- 
orem 5.7 and (5.41). 


Example 5.13. Consider the M-estimators given in Example 5.7 based 
on i.i.d. random variables Xj, ...,X,. If ~ is bounded and continuous, then 
Theorem 5.7 applies and (5.98) holds. For case (ii), w(a, y) is not bounded 
but is nondecreasing in y (—¢(x, y) is nonincreasing in y). Hence Theorem 
5.13 can be applied to this case. 


Consider Huber’s 7) given in Example 5.7(v). Assume that F' is contin- 
uous at 0— C and 6+ C. Then 


y+C 
vo)= [0 -a)dF(e) +CF-0)-Ch- FO +0) 


is differentiable at 6 (exercise); U(0) = 0 if F is symmetric about @ (exer- 
cise); and 


y+ 

[werear@ =f G-0)aF@)+C?FO—C)+C%L- FO +0) 
y-C 

is continuous at @ (exercise). Therefore, (5.98) holds with 


0+C 
g=C 


(0 — x)2dF (x) + C2F(9—C) + 0?[1 — F(0 +0) 
[F@+C)—-FO-CO)P 


a. = 


(exercise). Note that Huber’s M-estimator is robust in Hampel’s sense. 
Asymptotic relative efficiency of 6, w.r.t. the sample mean X can be ob- 
tained (exercise). I 


The next result is for general 6 and independent X;’s. 


Theorem 5.14. Suppose that y;(x,y) = Owi(x,y)/Oy7 exists and the 
sequence of functions {y;;,i = 1,2,...} satisfies the conditions in Lemma 
5.3 with © replaced by a compact neighborhood of @, where ;; is the jth 
row of y;; sup; E||¢;(X;,4)||?+° < co for some 6 > 0 (this condition can be 
replaced by E||w(X1, )||? < oo if X;’s are iid. and yj =); Elwi(Xi, 9)] = 
0; liminf, A~[n~Var(sn(0))] > 0 and liminf, A~[n~'M,,(0)] > 0, where 
M,,(0) = —E[Vs,,(0)] and A_[A] is the smallest eigenvalue of the matrix 
A. If {6,} is a consistent sequence of GEE estimators, then 


V,-/2(6, — 0) +a Nx(0, In), (5.99) 


where 
Vn = [Mn(0)]7'Var(sn(0))[Mn(0)]7?. (5.100) 


370 5. Estimation in Nonparametric Models 


Proof. The proof is similar to that of Theorem 4.17. By the consistency 
of 6, we can focus on the event {6, € Ac}, where A, = {y: ||y — 4] < e} 
with a given € > 0. For sufficiently small ¢, it can be shown (exercise) that 


vane LZSn(7) = Vn (6)ll 


max = = 0,(1), (5.101) 


using an argument similar to the proof of Lemma 5.3. From the mean-value 
theorem and s,,(6,) = 0, 


get if Vin(@+tOn — aya (n — 8). 


It follows from (5.101) that 


1 


n 


i Vin(6 + t(d, — 9))dt — V5n(0| iit 


Also, by Theorem 1.14(ii), 
n—*||Vsn(0) + Mn(9)|| = op(1)- 
This and lim inf, A_[n~1M,,(8)] > 0 imply 
[Mn (0)]~!8n(8) = [1 + op(1)](6n — 8). 
The result follows if we can show that 
V-/21Mn(0)|~8n(0) >a Ne (0, Ip). (5.102) 


For any nonzero 1 € R*, 
1 es _ 
V,p a7 S> Bll" [Mn (9) bi Xi, 9)|?*? 0, (5.103) 
a i=1 


since lim inf, \~[n~!Var(sn(0))] > 0 and sup; E]|;(Xi, 0)||?+® < oo (ex- 
ercise). Applying the CLT (Theorem 1.15) with Liapounov’s condition 
(5.103), we obtain that 


I7[M;,(0)}~*5n(0)/V/17 Val a N(0, 1) (5.104) 
for any J, which implies (5.102) (exercise). I 
Asymptotic normality of GEE estimators can be established under var- 


ious other conditions; see, for example, Serfling (1980, Chapter 7) and He 
and Shao (1996). 


5.5. Variance Estimation 371 


If X;’s are iid. and w; = w, the asymptotic covariance matrix in (5.100) 
reduces to 


Vp = 7 E[e(X1, A} EY, 9) [b(X1, 9)” HE[y(X1, OF, 


where v(x, y) = 0W(x,y)/Oy. When @ is univariate, V,, further reduces to 


Va =n BlW(X1, 0)? /{E[e(X1, A)]}?. 


Under the conditions of Theorem 5.14, 


Ely(%1,)] = [or 9) 4 =5 [vee 2, 6)4F (2 


Hence, the result in Theorem 5.14 coincides with that in Theorem 5.13. 


Example 5.14. Consider the quasi-likelihood equation in (5.90) and y; in 
(5.96). If sup; ||Z;|| < oo, then 7; satisfies the conditions in Theorem 5.14 
(exercise). Let V, (7) = D, (y)))/? Ri [Di(q)]\/?. Then 


Var (sn (0 >> Gi( J *Var( Xi) [Vn (7 1G)" 


and 
n 


M,(8) = > Gi()[Vn(®)]- 1Gi(9)]- 


i=1 


If R; = R; (the true correlation matrix) for all 7, then 
Var(sn(0 >> biGi(0) [Vn (0)]*[Gi(0)]". 


If, in addition, ¢; = ¢, then 


Vn = [Mn ()]~*Var(sn())[Mn(0)]-* = o[Mn(@)J-". 0 


5.5 Variance Estimation 


In statistical inference the accuracy of a point estimator is usually assessed 
by its mse or amse. If the bias or asymptotic bias of an estimator is (asymp- 
totically) negligible w.r.t. its mse or amse, then assessing the mse or amse is 
equivalent to assessing variance or asymptotic variance. Since variances and 
asymptotic variances usually depend on the unknown population, we have 
to estimate them in order to report accuracies of point estimators. Vari- 
ance estimation is an important part of statistical inference, not only for 


372 5. Estimation in Nonparametric Models 


assessing accuracy, but also for constructing inference procedures studied 
in Chapters 6 and 7. See also the discussion at the end of §2.5.1. 


Let 6 be a parameter of interest and 6,, be its estimator. Suppose that, 
as the sample size n — co, 


V2.6, — 0) +a Nx(0, In), (5.105) 


where V,, is the covariance matrix or an asymptotic covariance matrix of 
6,,. An essential asymptotic requirement in variance estimation is the con- 
sistency of variance estimators according to the following definition. See 
also (3.60) and Exercise 116 in §3.6. 


Definition 5.4. Let {V,,} be a sequence of k x k positive definite matrices 
and V, be a positive definite matrix estimator of V, for each n. Then {V,} 
or V,, is said to be consistent for V;, (or strongly consistent for V,,) if and 
only if 

Vit? VV? — Tel] sp 0 (5.106) 


(or (5.106) holds with —, replaced by —,.s.). I 


Note that (5.106) is different from ||Vj,—Vn|| >) 0, because ||V;|| > 0 in 
most applications. It can be shown (Exercise 93) that (5.106) holds if and 
only if I7Vnln/IZVnln —p 1 for any sequence of nonzero vectors {In} C R*. 
If (5.105) and (5.106) hold, then 


Vi-1/2(6n — 0) +a Nx (0, Iu) 


(exercise), a result useful for asymptotic inference discussed in Chapters 6 
and 7. 


If the unknown population is in a parametric family indexed by 6, then 
V, is a function of 0, say V, = V,(@), and it is natural to estimate V,,(@) 
by Vn(6n). Consistency of V,(6,) according to Definition 5.4 can usually 
be directly established. Thus, variance estimation in parametric problems 
is usually simple. In a nonparametric problem, V, may depend on un- 
known quantities other than @ and, thus, variance estimation is much more 


complex. 


We introduce three commonly used variance estimation methods in this 
section, the substitution method, the jackknife, and the bootstrap. 


5.5.1 The substitution method 


Suppose that we can obtain a formula for the covariance or asymptotic 
covariance matrix V, in (5.105). Then a direct method of variance estima- 
tion is to substitute unknown quantities in the variance formula by some 


5.5. Variance Estimation 373 


estimators. To illustrate, consider the simplest case where Xj,...,X, are 
iid. random d-vectors with E||Xj||? < 00, 6 = 9(u), w= EX1, bn = g(X), 
and g is a function from R? to R®. Suppose that g is differentiable at pu. 
Then, by the CLT and Theorem 1.12(i), (5.105) holds with 


Vin = [Va(u)]’ Var(X1)V9(u)/n, (5.107) 


which depends on unknown quantities w and Var(X,). A substitution esti- 
mator of V,, is : 7 - 
Vn = [Vo(X)]"S°VG(X)/n, (5.108) 


where 


C= 


n—-14 


Cm ole cme os 


is the sample covariance matrix, an extension of the sample variance to the 
multivariate X;’s. 

By the SLLN, X 4.5, 4 and S? —,.,, Var(X,). Hence, V,, in (5.108) 
is strongly consistent for V,, in (5.107), provided that Vg(w) 4 0 and Vg is 
continuous at p. 


Example 5.15. Let Yj,..., Yn be i.i.d. random variables with finite uw, = 
EY,, CG; = Var(Y1), yy = EY?, and x, = EY;. Consider the estimation 
of 0 = (fy,02). Let 0, = (X,62), where 62 = n-1>~7_,(¥; -Y)?. If 
X; = (Yi, Y2), then 6, = g(X) with g(x) = («1,22 — 22). Hence, (5.105) 
holds with 


2 2 2 
o Vy — by (oy + Hy) 
Var(X1) = ( = i ” 2) y ( oa. 22 
Vy ~ By\%y 7 By Ky Oy T by 


cwsie(, 2): 


The estimator V,, in (5.108) is strongly consistent, since Vg(x) is obviously 
a continuous function. JW 


and 


Similar results can be obtained for problems in Examples 3.21 and 3.23 
and Exercises 100 and 101 in §3.6. 

A key step in the previous discussion is the derivation of formula (5.107) 
for the asymptotic covariance matrix of 6, = g(X) via Taylor’s expansion 
(Theorem 1.12) and the CLT. Thus, the idea can be applied to the case 
where 6, = T(F;,), a differentiable statistical functional. 

We still consider i.i.d. random d-vectors Xj,...,X, from F. Suppose 
that T is a vector-valued functional whose components are o-Hadamard 


374 5. Estimation in Nonparametric Models 


differentiable at F’, where og is either @,. or a distance satisfying (5.42). 
Let @r be the vector of influence functions of components of T. If the 
components of ¢p satisfy (5.40), then (5.105) holds with 6 = T(F), 6, = 
T(f,), Fn = the empirical c.d.f. in (5.1), and 


_ Var (or (X1 


Vn Dak f be(a\lor (lar (a) (5.109) 


n 


Formula (5.109) leads to a natural substitution variance estimator 


n 


t= = f 60, (o)lOn, (2) AFa 2) = Yon, (XlOn, KI)" 6-110) 


i=1 


provided that ¢p,,(x) is well defined, i.e., the components of T are Gateaux 
differentiable at F;, for sufficiently large n. Under some more conditions on 
or, we can establish the consistency of V,, in (5.110). 


Theorem 5.15. Let X1,...,Xp be iid. random d-vectors from F’, T be 
a vector-valued functional whose components are Gateaux differentiable at 
F and F,,, and ¢p be the vector of influence functions of components of 
T. Suppose that supy,)/<¢ lor, (x) — dr(x)|| = op(1) for any ¢ > 0 and 
that there exist a constant co > 0 and a function h(x) > 0 such that 
E{h(X1)] < co and P(|\¢r, (x)||? < h(a) for all ||a|| > co) > 1. Then V, in 
(5.110) is consistent for V,, in (5.109). 

Proof. Let ¢(«) = ¢r(x)[br()|” and ¢,(2) = or, (2)[or, (x)|7. By the 
SLLN, 


~ 26%) Fa. i; ¢(2)dF(2). 


Hence the result follows from 


n 


S“[Gn(Xi) — ¢(Xi)] 


i=1 


1 


n 


| =0,(1). 


Using the assumed conditions and the argument in the proof of Lemma 5.3, 
we can show that for any e > 0, there is ac > 0 such that 


Pp 6» Ion (Xa) — CX) IIZ(c,00) (I Xall) > 5) ch 


and 
P @ [Gn(Xa) — Xa) MMMio,q\(IXall) > 7 aad 


for sufficiently large n. This completes the proof. 


5.5. Variance Estimation 375 


Example 5.16. Consider the L-functional defined in (5.46) and the L- 
estimator 6, = T(F,). Theorem 5.6 shows that T is Hadamard differentiable 
at F under some conditions on J. It can be shown (exercise) that T is 
GAateaux differentiable at F,, with dr, (x) given by (5.48) (with F replaced 
by F;,). Then the difference or, (x) — dr(x) is equal to 


[ee —PW@seady + [= 6.\y)I Eau) - JO. 


One can show (exercise) that the conditions in Theorem 5.15 are satisfied 
if the conditions in Theorem 5.6(i) or (ii) (with E|X1|< oo) hold. I 


Substitution variance estimators for M-estimators and U-statistics can 
also be derived (exercises). 

The substitution method can clearly be applied to non-i.i.d. cases. For 
example, the LSE @ in linear model (3.25) with a full rank Z and iid. €;’s 
has Var(3) = 02(Z7Z)—!, where o? = Var(e1). A consistent substitution 
estimator of Var(3) can be obtained by replacing o? in the formula of Var() 
by a consistent estimator of 0? such as SSR/(n—p) (see (3.35)). 


We now consider variance estimation for the GEE estimators described 
in §5.4.1. By Theorem 5.14, the asymptotic covariance matrix of the GEE 
estimator @,, is given by (5.100), where 


Var(sn(8)) = > Evil Xi, 8)(b(%. 91}, 


w=1 


n 


M,(6) = 9_ Elyi(Xi, 9], 


i=1 


and 9;(2,7) = Ov;(x,7)/Oy. Substituting 6 by 6, and the epee ba ans 
their aoe analogues, we obtain the substitution estimator Va = 
M~'Var(sn)M~} , where 


Var( (Sn) 25a Xi, On )[wi(Xi, On Te 


w=1 


and z 
Mn =o (Ke 8 
i=1 
The proof of the following result is left as an exercise. 


Theorem 5.16. Let X1,...,X» be independent and {6} be a consistent 
sequence of GEE estimators. Assume the conditions in Theorem 5.14. Sup- 
pose further that the sequence of functions {h,;;,i = 1,2,...} satisfies the 


376 5. Estimation in Nonparametric Models 


conditions in Lemma 5.3 with © replaced by a compact neighborhood of 6, 
where h,;(z,y) is the jth row of y;(x,7)[Wi(a, y)]’, 9 =1,...,k. Let Vn be 
given by (5.100). Then V, = M7! Var(s,)Mz! is consistent for V,. I 


n 


5.5.2 The jackknife 


Applying the substitution method requires the derivation of a formula for 
the covariance matrix or asymptotic covariance matrix of a point estimator. 
There are variance estimation methods that can be used without actually 
deriving such a formula (only the existence of the covariance matrix or 
asymptotic covariance matrix is assumed), at the expense of requiring a 
large number of computations. These methods are called resampling meth- 
ods, replication methods, or data reuse methods. The jackknife method 
introduced here and the bootstrap method in 85.5.3 are the most popular 
resampling methods. 

The jackknife method was proposed by Quenouille (1949) and Tukey 
(1958). Let 6, be a vector-valued estimator based on independent X;’s, 
where each X; is a random d;-vector and sup;d; < oo. Let 6_; be the 
same estimator but based on Xj,..., Xi-1, Xi+1,-.., Xn, ¢ = 1,...,n. Note 
that 6_; also depends on n but the subscript n is omitted for simplicity. 
Since 6, and 6-1, ...,0_», are estimators of the same quantity, the “sample 
covariance matrix” 


= I > (6-:- 8) (6i1-O)” (5.111) 


i=1 


can be used as a measure of the variation of bn, where 6, is the average of 
0_;’s. 

There are two major differences between the quantity in (5.111) and 
the sample covariance matrix S$? previously discussed. First, 6_;’s are not 
independent. Second, 6_; — 6_; usually converges to 0 at a fast rate (such 
as n+). Hence, to estimate the asymptotic covariance matrix of 6, the 
quantity in (5.111) should be multiplied by a correction factor cn. If 6, = X 
(d; = d), then 6_; — 9, = (n —1)~!(X —_X;) and the quantity in (5.111) 
reduces to 


rE si 90. hps 


where S$? is the sample covariance matrix. Thus, the correction factor cp, 
is (n — 1)/n for the case of 6, = X since, by the SLLN, $?/n is strongly 
consistent for Var(X). 


5.5. Variance Estimation 377 


It turns out that the same correction factor works for many other esti- 
mators. This leads to the following jackknife variance estimator for 6,: 


n 


= = (6-:-) (0: - on) 3 (5.112) 


Theorem 5.17. Let Xj,...,X, be iid. random d-vectors from F' with 
finite p = E(X,) and Var(X,), and let 6, = g(X). Suppose that Vg is 
continuous at js and Vg(u) 4 0. Then the jackknife variance estimator Vy 
in (5.112) is strongly consistent for V, in (5.107). 

Proof. We prove the case where g is real-valued. The proof of the gen- 
eral case is left to the reader. Let X_; be the sample mean based on 
X4,..., Xj-1, Xi41,---, Xn. From the mean-value theorem, we have 


where Rp = [V9(En,a) _ Vg(X)]" (X1 — X) and €,,; is a point on the 


line segment between X_; and X. From X_;— X = (n—1)~'(X — Xj), it 
follows that S7j_,(X_; — X) =0 and 


n 


1 A A ee z 

- 6-,;-9n)=-—) Rnji = Rn. 
LS 1-H) = 25 Re 

From the definition of the jackknife estimator in (5.112), 


Vy > An + Bn t+2Ch, 


where 
An = [Wo K)" SK 1— RK — VOR), 
B, = — : S7(Rni Rn)”, 
and 7 : 
On = "= SR — Rad[Vo( RI (1 - ¥) 


By X_;-— X = (n—1)7!(X — Xj), the SLLN, and the continuity of Vg at 


HM, 
An/Vn a.s. 1. 


378 5. Estimation in Nonparametric Models 


Also, 
n _ m 1 n 7 
= aes 2 _ ry2— 
(n Des x|| eer X|? =O(1) as. (5.113) 


Hence : 7 
max |X; _ X|? a.s. 0, 
i<n 


which, together with the continuity of Vg at wand ||€,,;—X|| < ||X_-:—X||, 
implies that - 
ba tering IIVo(En,i) — V9(X)|| as. 0. 


From (5.107) and (5.113), 77_, || Xi — X||?/Vn = O(1) a.s. Hence 


a <i E Peo Meee: as, 0. 
” G=1 


By the Cauchy-Schwarz inequality, (Cp/Vn)? < (An/Vn)(Bn/Vn) a.s. 0. 
This proves the result. 


A key step in the proof of Theorem 5.17 is that 6_; —6, can be approx- 
imated by [Vg(X)]"(X_; — X) and the contributions of the remainders, 
Raj,--;Rnn, are sufficiently small, ie., By/Vn —a.s. 0. This # dicatas 
that the jackknife estimator (5.112) is consistent for 4, that can be well ap- 
proximated by some linear statistic. In fact, the jackknife estimator (5.112) 
has been shown to be consistent when 6, is a U-statistic (Arvesen, 1969) 
or a statistical functional that is Hadamard differentiable and continuously 
GAateaux differentiable at F’ (which includes certain types of L-estimators 
and M-estimators). More details can be found in Shao and Tu (1995, Chap- 
ter 2). 


The jackknife method can be applied to non-i.i.d. problems. A detailed 
discussion of the use of the jackknife method in survey problems can be 
found in Shao and Tu (1995, Chapter 6). We now consider the jackknife 
variance estimator for the LSE in linear model (3.25). For simplicity, 
assume that Z is of full as Assume also that ¢;’s are independent with 
E(e;) = 0 and Var(e;) = 0?. Then 


Var(8) = (2°Z)"S° of Z,Z] (ZZ) 
i=1 
Let Bi be the LSE of 3 based on the data with the ith pair (X;, Z;) deleted. 
Using the fact that (A + cc7™)~1 = A71 — A-tec7A71/(1+c7A7'c) fora 
matrix A and a vector c, we can show that (exercise) 


64 = B-1:Zi/(1— hi), (5.114) 


5.5. Variance Estimation 379 


where r; = X; — Z7G is the ith residual and hj = aoe Hence 


a 


aE ‘ 
Sia jeer 


Wu (1986) proposed the following weighted jackknife variance estimator 
that improves V7: 


‘ =e 
VS 991 
n 


n 


Vs = > (1 -ha) (6 = 3) (6 = a)’ = (Z7Z)™* x AE gray. 


i=1 i=l 


Theorem 5.18. Assume the conditions in Theorem 3.12 and that €;’s are 
independent. Then both V, and Vy, are consistent for Var(@). 

Proof. Let |, € R?, n = 1,2,..., be nonzero vectors and |; = 17(Z7Z)~1Zj. 
Since maxj<n hj — 0, the result for Vw 7 follows from 


a / ee? pl (5.115) 
i=1 t=1 


(see Exercise 93). By the WLLN (Theorem 1.14(ii)) and maxj<, hi — 0, 


Sobek [Sotto 9, 
i=1 i=1 
Note that r; =e; + Z7 (6 — B) and 
max([Z7 (8 — B))? < ||Z(8 — 8)||? max h; = op(1). 
Hence (5.115) holds. 
The consistency of V; follows from 5.115) and 


( 
ioe ) Yo ttet 2 = o,(1 (5.116) 


The proof of (5.116) is left as an exercise. I 


Finally, let us consider the jackknife estimators for GEE estimators in 
85.4.1. Under the conditions of Proposition 5.5 or 5.6, it can be shown that 


max ||; — 6|| = o,(1), (5.117) 


where 6_; is a root of Sn; (y) = 0 and 


sni(y) = >> 5(X5,9) 


JFUJ<n 


380 5. Estimation in Nonparametric Models 


Assume that w;(x,7) is continuously differentiable w.r.t. ~ in a neighbor- 
hood of 6. Using Taylor’s expansion and the fact that s,;(0_;) = 0 and 
Sn(On) = 0, we obtain that 


Wi(X;,6_4) = if Vsn(On + t(6-i —9n))dt| (6-1 — bn). 


Following the proof of Theorem 5.14, we obtain that 


Vj = = ST vil (X;, 6-1) [bil Xi, 0 ale M,,(9)]-* + Rn; 


t=1 


where R,, satisfies ||Va\/?RnVn /?|| = 0p(1) for Vp in (5.100). Under the 
conditions of Theorem 5.16, it follows from (5.117) that Vj is consistent. 
If 6, is computed using an iteration method, then the computation of 


Vy requires n additional iteration processes. We may use the idea of a 
one-step MLE to reduce the amount of computation. For each 7, let 


6_; = On — [VS8nilbn)] 7 S8ni(On), (5.118) 


which is the result from the first iteration when the Newton-Raphson 
method is applied in computing a root of sy;(y) = 0 and 6, is used as 
the initial point. Note that 6_;’s in (5.118) satisfy (5.117) (exercise). If the 
jackknife variance estimator is based on 6_;’s in (5.118), then 


Vy = il Xi, bn) [thi (Xi, On)]” [Mn (8)? + Rn 


t=1 


where R, satisfies \|Via t/ oy 2 = 0p(1). These results are summarized 
in the following theorem. 


Theorem 5.19. Assume the conditions in Theorems 5.14 and 5.16. As- 
sume further that 0_;’s are given by (5.118) or GEE estimators satisfying 
(5.117). Then the jackknife variance estimator V7 is consistent for V, given 
n (5.100). § 


5.5.3 The bootstrap 


The basic idea of the bootstrap method can be described as follows. Sup- 
pose that P is a population or model that generates the sample X and that 
we need to estimate Var(0), where 6 = 0(X) is an estimator, a statistic 
based on X. Suppose further that the unknown population P is estimated 

P, based on the sample X. Let X* be a sample (called a bootstrap 


5.5. Variance Estimation 381 


sample) taken from the estimated population P using the same or a similar 
sampling procedure used to obtain X, and let 6* = 6(X*), which is the 
same as 6 but with X replaced by X*. If we believe that P = P (ie., 
we have a perfect estimate of the population), then Var(#) = Var.(6*), 
where Var, is the conditional variance w.r.t. the randomness in generating 
X*, given X. In general, P 4 P and, therefore, Var(6) 4 Var.(6*). But 
Vp = Var.(6") is an empirical analogue of Var(@) and can be used as an 


estimate of Var(6). 


In a few cases, an explicit form of Vg = Var.(6*) can be obtained. 
First, consider i.i.d. Xj,...,X, from a c.d.f. F on R¢. The population is 
determined by F’. Suppose that we estimate F’ by the empirical c.d.f. Fy, 
in (5.1) and that X*,...,X* are iid. from F,. For 6 =X, its bootstrap 
analogue is 6* = X*, the average of X;"’s. Then 

= n—1 


ii n 
Vp = Var,(X Oe oa X, — X)’ = ——S?, 


where S$? is the sample covariance matrix. In this case Vg = Var.(X*) is 
a strongly consistent estimator for Var(X). Next, consider i.i.d. random 
variables X,...,X» from a c.d.f. F on R and 6 = F>+(4), the sample 
median. Suppose that n = 21 — 1 for an integer J. Let Xf,...,X7 be i.i-d. 
from F,, and 6* be the sample median based on Xf,...,X7. Then 


7 2 
Vp = Var, ( 6) Ly a = Yn) , 
i=1 


where X(1) < +++ < X(q) are order statistics and p; = P(@* = X(j)|X). I 
can be shown (eacise) that 


I-1 . tie iy mt — gt(, — 4yn-t 
=> ()e j+1) ie (5.119) 


nr 
t=0 


However, in most cases Ve does not have a simple explicit form. When 
P is known, the Monte Carlo method described in §4.1.4 can be used to 
approximate Var(0). That is, we draw repeatedly new data sets from P and 
then use the sample covariance matrix based on the values of 6 computed 
from new data sets as a numerical approximation to Var(). This idea 
can be used to approximate Ve, since P is a known population. That is, 
we can draw m bootstrap data sets X *l.., X*™ independently from P 
(conditioned on X), compute Or = 6(X*3), j =1,...,m, and approximate 


Vp by 7 
la) see 


Fic 


382 5. Estimation in Nonparametric Models 


where &* is the average of 6*3’s. Since each X*/ is a data set generated from 
P, Vie is a resampling estimator. From the SLLN, as m — ~, VP ane. 
Vz, conditioned on X. Both Vg and its Monte Carlo approximation Ve are 
called bootstrap variance estimators for 6. Ve is more useful in practical 
applications, whereas in theoretical studies, we usually focus on Vg. 

The consistency of the bootstrap variance estimator Vg is a much more 
complicated problem than that of the jackknife variance estimator in §5.5.2. 
Some examples can be found in Shao and Tu (1995, §3.2.2). 

The bootstrap method can also be applied to estimate quantities other 
than Var(6). For example, let K(t) = P(0 < t) be the c.d.f. of a real-valued 
estimator 6. From the previous discussion, a bootstrap estimator of K(t (t) 
is the conditional probability P(O* < t|X), which can be approximated 
by the Monte Carlo approximation m7! yeas er (6*7). An important 
application of bootstrap distribution estimators in problems of constructing 
confidence sets is studied in §7.4. Here, we study the use of a bootstrap 
distribution estimator to form a consistent estimator of the asymptotic 
variance of a real-valued estimator 0. 


Suppose that : 
J/n(6 — 0) 4 N(0,v), (5.120) 


where v is unknown. Let H,(t) be the c.df. of /n(6 — 0) and 
p(t) = P(/n(6* — 6) < t|X) (5.121) 
be a bootstrap estimator of H,,(t). If 
Ha(t) — H(t) > 0 
for any t, then, by (5.120), 
— © (t/Vv) >, 0, 
which implies (Exercise 112) that 
B (0) >p V020 
for any a € (0,1), where z. = ®~1(a). Then, for a ¢ $, 


fig}(1 — a) — Hg"(a) +p Valera — %a)- 


Therefore, a consistent estimator of v/n, the asymptotic variance of 6, is 


ge Ag i(1 — a) — H3*(a) 


5.6. Exercises 383 


The following result gives some conditions under which Hg (t)—H,,(t) —, 0. 
The proof of part (i) is omitted. The proof of part (ii) is given in Exercises 
113-115 in 85.6. 


Theorem 5.20. Suppose that Xj,...,X, are iid. from ac.df. F on R?. 
Let 6 = T(F,), where T is a real-valued functional, 6* = T(F*), where F* is 
the empirical c.d.f. based on a bootstrap sample X7f,..., X7 ii.d. from F),, 
and let Hg be given by (5.121). 

(i) If T is 0..-Hadamard differentiable at F' and (5.40) holds, then 


9 


0co( Ha, Hn) +p 0. (5.122) 


(ii) If d = 1 and Tis g,, -Fréchet differentiable at F (f{F(t)[1— F())]}"/7at 
< oo if 1 < p< 2) and (5.40) holds, then (5.122) holds. I 


Applications of the bootstrap method to non-i.i.d. cases can be found, 
for example, in Efron and Tibshirani (1993), Hall (1992), and Shao and Tu 
(1995). 


5.6 Exercises 


1. Let 04 be the sup-norm distance. Find an example of a sequence 
{G,,} of c.d-f’s satisfying G, > G for ac.df. G, but 0.(Gn, G) 
does not converge to 0. 


2. Let Xy,..., Xn be i.i.d. random d-vectors with c.d.f. F and F,, be the 
empirical c.d.f. defined by (5.1). Show that for any t > 0 and e > 0, 
there is a C.,q such that for all n = 1, 2,..., 


C. gen 2b 
< 
) ~ 1 —e-(Q-.)t? 


P (sup en Chee Be 


man 


3. Show that @,,, defined by (5.4) is a distance on Fp, p 2 1. 
4. Show that ||- ||z, in (5.5) is a norm for any p > 1. 


5. Let ¥, be the collection of c.d.f.’s on R with finite means. 
(a) Show that 04, (G1, G2) = f, |G71(z) — Gy 1(z)|dz, where G1(z) 
= inf{t: G(t) > z} for any Ge F. 
(b) Show that Om, (Gi, G2) = OL, (Gi, Go). 


6. Find an example of a sequence {G;} C ¥ for which 
(a) limjoo Qx0(Gj,Go) = 0 but e,,,(G;,Go) does not converge to 0; 
(b) limj oo @y,(Gj,Go) = 0 but e.(G;,Go) does not converge to 0. 


384 


10. 


11. 


12. 


13. 


14. 


15. 


16. 


5. Estimation in Nonparametric Models 


. Repeat the previous exercise with @,,, replaced by @,,. 


. Let X be a random variable having c.d.f. F. Show that 


(a) E|X|? < oo implies [{F(t)[1 — F()]}?/2dt < 00 for p € (1, 2); 
(b) E|X|?+° < co with some 6 > 0 implies [{F(t)[1 — F(t)]}1/2dt < 
Cc. 


. For any one-dimensional G; € F1, j = 1,2, show that 9, (G1,G2) > 


| f xdGy — f rdGo|. 


In the proof of Theorem 5.3, show that p; = c/n, i = 1,...,.n, AX = 
—(c/n)"~+ is a maximum of the function H(p1,...,Pn, A) over pi > 0, 
Ne oS a Dy ne 


Show that (5.11)-(5.13) is a solution to the problem of maximizing 
£(G) in (5.8) subject to (5.10). 


In the proof of Theorem 5.4, prove the case of m > 2. 


Show that a maximum of ¢(G) in (5.17) subject to (5.10) is given by 
(5.11) with ; defined by (5.18) and (5.19). 


In Example 5.2, show that an MELE is given by (5.11) with p;’s given 
by (5.21). 


In Example 5.3, show that 
(a) maximizing (5.22) subject to (5.23) is equivalent to maximizing 


n 


5a tae Ss 
[[a°a- a)" **%, 


i=l 


where Gi = Bak Pj t= 1, sey TL; 

(b) F given by (5.24) maximizes (5.22) subject to (5.23); (Hint: use 
part (a) and the fact that pj = q paere! — q;)-) 

(c) F given by (5.25) is the same as that in (5.24); 

(d) if 6; = 1 for all ¢ (no censoring), then F’ in (5.25) is the same as 
the empirical c.d-f. in (5.1). 


Let fn be given by (5.26). 

(a) Show that f, is a Lebesgue p.d.f. on R. 

(b) Suppose that f is continuously differentiable at t, An — 0, and 
NAn — co. Show that (5.27) holds. 

(c) Under nA3 — 0 and the conditions of (b), show that (5.28) holds. 
(d) Suppose that f is continuous on [a,b], -co <a <b <0, An 3 0, 
and n\n — oo. Show that [” fn(t)dt +p J” f(t)dt. 


5.6. Exercises 385 


17. 


18. 
19. 


20. 


21. 


22. 
23. 


24. 


25. 


26. 


Let f be given by (5.29). 

(a) Show that f is a Lebesgue p.d.f. on R. 

(b) Prove (5.30) under the condition that A, — 0, nAn — oo, and 
f is bounded and continuous at t and f[[w(t)]?dt < co. (Hint: check 
Lindeberg’s condition and apply Theorem 1.15.) 

(c) Assume that A, > 0, nAn — 00, w is bounded, and f is bounded 
and continuous on [a,b], —oo < a < b < ow. Show that if f(t)dt , 


b 
J, f(e)dt. 
Prove (5.32)-(5.34) under the conditions described in §5.1.4. 


Show that A(t) in (5.35) is a consistent estimator of K(t) in (5.34), 
assuming that 6 —, @, ¢ is a continuous function on R, (Xi, Z;)’s 
are ii.d., and ||Z;|| < c for a constant c > 0. 


Let £(0,€) be a likelihood. Show that a maximum profile likelihood 
estimator 6 of @ is an MLE if €(@), the maximum of sup, (9, €) for a 
fixed #, does not depend on 6. 


Let Xj,..., Xn be iid. from N(,07). Derive the profile likelihood 
function for 4 or o?. Discuss in each case whether the maximum 
profile likelihood estimator is the same as the MLE. 


Derive the profile empirical likelihoods in (5.36) and (5.37). 


Let X1,..., Xp bei.i.d. random variables from ac.d.f. F and let r(a) = 
P(6; = 1|X; = x), where 6; = 1 if X; is observed and 6; = 0 if X; is 
missing. Assume that 0 < 7 = f m(x)dF (x) <1. 

(a) Let Fi(a) = P(X; < x|d; = 1). Show that F and F, are the same 
if and only if m(a) = 7. 

(b) Let F’ be the c.d.f. putting mass r~! to each observed X;, where 
r is the number of observed X;’s. Show that F(x) is unbiased and 
consistent for F\(x), x € R. 

(c) When x(x) = 7, show that F(x) in part (b) is unbiased and 
consistent for F(a), « € R. When x(x) is not constant, show that 
F(a) is biased and inconsistent for F(a) for some « € R. 


Show that o-Fréchet differentiability implies e-Hadamard differentia- 
bility. 


Suppose that a functional T is Gateaux differentiable at F' with a 
continuous differential Ly in the sense that @..(A;,A) — 0 implies 
Lr(A;) > Lr(A). Show that ¢p is bounded. 


Suppose that a functional T is Gateaux differentiable at F' with a 
bounded and continuous influence function ¢r. Show that the differ- 
ential Lr is continuous in the sense described in the previous exercise. 


386 


27. 


28. 


29. 


30. 


31. 


32. 


33. 


34. 


35. 


36. 


37. 


38. 


39. 


40. 


Al. 


42. 


5. Estimation in Nonparametric Models 


Let T(G) = g(f zdG) be a functional defined on ¥;, the collection of 
one-dimensional c.d.f.’s with finite means. 

(a) Find a differentiable function g for which the functional T is not 
0x-Hadamard differentiable at F’. 

(b) Show that if g is a differentiable function, then T is @, ,-Fréchet 
differentiable at F’. (Hint: use the result in Exercise 9.) 


In Example 5.5, show that (5.43) holds. (Hint: for A = c(G; — G2), 
show that ||Allv < |el(lGillv + |Gallv) = 2Ie|.) 


In Example 5.5, show that ¢@r is continuous if F' is continuous. 
In Example 5.5, show that T is not 0..-Fréchet differentiable at F’. 
Prove Proposition 5.1 (ii). 


Suppose that T is first-order and second-order o-Hadamard differen- 
tiable at F’. Prove (5.45). 


Find an example of a second-order o-Fréchet differentiable functional 
T that is not first-order o-Hadamard differentiable. 


Prove (5.47) and that (5.40) is satisfied for an L-functional if J is 
bounded and F' has a finite variance. 


Prove (iv) and (v) of Theorem 5.6. 


Discuss which of (i)-(v) in Theorem 5.6 can be applied to each of the 
L-estimators in Example 5.6. 


Obtain explicit forms of the influence functions for L-estimators in 
Example 5.6. Discuss which of them are bounded and continuous. 


Provide an example in which the L-functional T given by (5.46) is not 
0co-Hadamard differentiable at Ff’. (Hint: consider an untrimmed J.) 


Discuss which M-functionals defined in (i)-(vi) of Example 5.7 satisfy 
the conditions of Theorem 5.7. 


In the proof of Theorem 5.7, show that Ro; — 0. 


Show that the second equality in (5.51) holds when w is Borel and 
bounded. 


Show that the functional T in (5.53) is @..-Hadamard differentiable at 
F with the differential given by (5.54). Obtain the influence function 
gor and show that it is bounded and continuous if F’ is continuous. 


5.6. Exercises 387 


43. 


4A, 


45. 


A6. 


AT. 


48. 
49. 


50. 


51. 


52. 


Show that the functional T in (5.55) is 0..-Hadamard differentiable 
at F' with the differential given by (5.56). Obtain the influence func- 
tion dr and show that it is bounded and continuous if F'(y,0oo) and 
F'(co, z) are continuous. 


Let F be a continuous c.d.f. on R. Suppose that F' is symmetric 
about @ and is strictly increasing in a neighborhood of @. Show that 
Ar(t) = 0 if and only if t = 0, where Ap(t) is defined by (5.57) with 
a strictly increasing J satisfying J(1 —t) = —J(t). 


Show ee ey A in (5.57) is differentiable at @ and /,(0) is equal to 
ae ee (x)dF (x). 


Let a be an R-estimator satisfying the conditions in Theorem 5.8. 
Show that (5.41) holds with 


a= [ verre / if J'(F(a))F’ (x)dF (2) 


Calculate the asymptotic relative efficiency of the Hodges-Lehmann 
estimator in Example 5.8 w.r.t. the sample mean based on an i.i.d. 
sample from F' when 

(a) F is the c.d-f. of N(p, 07); 

(b) F is the c.d-f. of the logistic distribution LG(p, 0); 

(c 

(d 

t 


2 


) 

) F is the c.d.f. of the double exponential distribution DE(p, 0); 

) F(a) = Fo(x — 6), where Fo() is the c.d.f. of the t-distribution 
vy with v > 3. 


Let G be ac.df. on R. Show that G(x) > t if and only if > G-1(t). 


Show that (5.67) implies that 6, is strongly consistent for 6, and is 
/n-consistent for 6, if F’(@,—) and F’(@,+) exist and are positive. 


Under the condition of Theorem 5.9, show that, for p. = e725: ; 
- 2C pr 
P (sup dp — 81 >«) oe n=1,2,.... 
man b= Pe 


Prove that y,(t) in (5.69) is the Lebesgue p.d.f. of the pth sample 
quantile 6, when F has the Lebesgue p.d.f. f by 

(a) differentiating the c.d-f. of 6 in (5.68); 

(b) using result (5.66) and the result in Example 2.9. 


Let X1,...,Xn be iid. random variables from F' with a finite mean. 
Show that 6, has a finite jth moment for sufficiently large n, 7 = 
125.05. 


388 


53. 
54. 


55. 


56. 


57. 
58. 
59. 


60. 


61. 


62. 


63. 


64. 


5. Estimation in Nonparametric Models 


Prove Theorem 5.10(i). 


Suppose that a c.d.f. F has a Lebesgue p.d.f. f that is continuous 
at the pth quantile of F', p € (0,1). Using the p.d-f. in (5.69) and 
Scheffé’s theorem (Proposition 1.18), prove part (iv) of Theorem 5.10. 


Let {kn} be a sequence of integers satisfying k,/n = p + o(n—\/?) 
with p € (0,1), and let X4,...,X, be iid. random variables from a 
c.d.f. F with F’(8,) > 0. Show that 


V(X (in) — Op) +a N(0, p(1 — p)/[F’p)]”). 


In the proof of Theorem 5.11, prove (5.72), (5.75), and inequality 
(5.74). 


Prove Corollary 5.1. 
Prove the claim in Example 5.9. 


Let T(G)=G~!(p) be the pth quantile functional. Suppose that F has 
a positive derivative F” in a neighborhood of 6= F'~!(p). Show that 
T is Gateaux differentiable at F' and obtain the influence function. 


Let X1,..., Xn be ii.d. from the Cauchy distribution C(0, 1). 
(a) Show that E(X(;))? < co if and only if 3 <j <n—-2. 
(b) Show that E(0o.5)? < co for n > 5. 


Suppose that F is the c.d.f. of the uniform distribution U(@—$,0+4), 
0 €R. Obtain the asymptotic relative efficiency of the sample median 
w.r.t. the sample mean, based on an i.i.d. sample of size n from F’. 


Suppose that F(a) = Fo(a — 0) and Fo is the c.d.f. of the Cauchy 
distribution C(0, 1) truncated at c and —c, i.e., Fo has the Lebesgue 
p.df. (1+ 2?)"'I_¢.)(x)/ [1 + 2?)~ldt. Obtain the asymptotic 
relative efficiency of the sample median w.r.t. the sample mean, based 
on an i.i.d. sample of size n from F. 


Let X1,..., Xp be iid. with the c.d.f. (l-e)@ (=#)+eD (=#), where 
€ € (0,1) is a known constant, ® is the c.d.f. of the standard normal 
distribution, D is the c.d.f. of the double exponential distribution 
D(0,1), and w € R and o > 0 are unknown parameters. Consider 
the estimation of u. Obtain the asymptotic relative efficiency of the 
sample mean w.r.t. the sample median. 


Let Xj,...,Xpn be iid. with the Lebesgue p.d.f. 2-!(1 — 6?)e9*—I#|, 
where @ € (—1,1) is unknown. 
(a) Show that the median of the distribution of X, is given by m(@) = 


5.6. Exercises 389 


65. 


66. 


67. 


68. 
69. 


70. 


71. 


72. 


73. 


74. 


795. 


(1 — 6)~*log(1 + 6) when 6 > 0 and m(@) = —m(—@) when 0 < 0. 
(b) Show that the mean of the distribution of X, is (0) = 20/(1—67). 
(c) Show that the inverse functions of m(@) and ju(@) exist. Obtain 
the asymptotic relative efficiency of m~!(rn) w.r.t. w-!(X), where rh 
is the sample median and X is the sample mean. 

(e) Is w~1(X) in (d) asymptotically efficient in estimating 0? 


Show that X, in (5.77) is the L-estimator corresponding to the J 
function given in Example 5.6(iii) with G6 = 1—- a. 


Let Xj,...,X, be i.i.d. random variables from F’, where F' is symmet- 
ric about 6. 

(a) Show that X(;) — @ and @— X(n_j41) have the same distribution. 
(b) Show that et w;X(j) has ac.d.f. symmetric about 0, if w;’s are 
constants satisfying }7j'_, w; = 1 and w; = wn—j+1 for all j. 


(c) Show that the trimmed sample mean X,q has a c.d.f. symmetric 
about 0. 


Under the conditions in one of (i)-(iii) of Theorem 5.6, show that 
(5.41) holds for T(F;,) with o} given by (5.79), if o# < oo. 


Prove (5.78) under the assumed conditions. 


For the functional T given by (5.46), show that T(F)) = 0 if F is 
symmetric about 6, J is symmetric about 3, and fo J(t)dt = 1. 


Obtain the asymptotic relative efficiency of the trimmed sample mean 
Xq w.r.t. the sample mean, based on an i.i.d. sample of size n from the 
double exponential distribution DE(0,1), where 0 € R is unknown. 


Obtain the asymptotic relative efficiency of the trimmed sample mean 
Xq w.r.t. the sample median, based on an i.i.d. sample of size n from 
the Cauchy distribution C(@,1), where 6 € R is unknown. 


Consider the a-trimmed sample mean defined in (5.77). Show that 0? 
in (5.78) is the same as o% in (5.79) with J(t) = (1—2a)~'I(q.4-a)(t), 
when F(x) = Fo(a — @) and Fo is symmetric about 0. 


For o2 in (5.78), show that 

(a) if F{(0) exists and is positive, then lim 2 = 1/(2F)(0)|?; 

(b) if o? = f[ x?dFo(x) < oo, then limg.9 02 = 0”. 

Show that if J = 1, then o% in (5.79) is equal to the variance of the 
o.d.f. F. 


Calculate of in (5.79) with J(t) = 4t — 2 and F being the double 
exponential distribution DE(6,1),@0€ R. 


390 


76. 


77. 


78. 


79. 
80. 


81. 


82. 


83. 
84. 
85. 
86. 


87. 


88. 


89. 


90. 


5. Estimation in Nonparametric Models 


Consider the simple linear model in Example 3.12 with positive t;’s. 
Derive the L-estimator of 8 defined by (5.82) with a J symmetric 
about s and compare it with the LSE of (. 


Consider the one-way ANOVA model in Example 3.13. Derive the 
L-estimator of 3 defined by (5.82) when (a) J is symmetric about 4 
and (b) J(t) = (1 — 2a)~"Iq,1~a)(t). Compare these L-estimators 
with the LSE of (. 


Show that the method of moments in §3.5.2 is a special case of the 
GEE method. 


Complete the proof of Proposition 5.4. 


In the proof of Lemma 5.3, show that the probability in (5.94) is 
bounded by e. 


In Example 5.11, show that y,’s satisfy the conditions of Lemma 5.3 
if © is compact and sup, ||Z;|| < oo. 


In the proof of Proposition 5.5, show that {A,,(7)} is equicontinuous 
on any open subset of O. 


Prove Proposition 5.6. 
Prove the claim in Example 5.12. 
Prove the claims in Example 5.13. 


For Huber’s M-estimator discussed in Example 5.13, obtain a formula 
for e(F), the asymptotic relative efficiency of 6, w.r.t. X, when F is 
given by (5.76). Show that lim;..e(F’) = oo. Find the value of 
e(F) when € = 0,0 =1, and C=1.5. 


Consider the y function in Example 5.7(ii). Show that under some 
conditions on F’, w satisfies the conditions given in Theorem 5.13(i) 
or (ii). Obtain o% in (5.98) in this case. 


In the proof of Theorem 5.14, show that 

(a) (5.101) holds; 

(b) (5.103) holds; 

(c) (5.104) implies (5.102). (Hint: use Theorem 1.9(iii).) 


Prove the claim in Example 5.14, assuming some necessary moment 
conditions. 


Derive the asymptotic distribution of the MQLE (the GEE estima- 
tor based on (5.90)), assuming that X; = (Xi1,...,Xia,), E(Xn) = 
me” /(1 +e"), Var(Xiz) = mdje™ /(1 + e™)?, and (4.57) holds with 
g(t) = log 7. 


5.6. Exercises 391 


91. 


92. 


93. 


94. 
95. 


96. 


97. 
98. 
99. 


100. 


101. 


102. 


Repeat the previous exercise under the assumption that E(Xiz) = e™, 
Var(Xiz) = die”, and (4.57) holds with g(t) = logt or g(t) = 2v‘. 


In Theorem 5.14, show that result (5.99) still holds if R; is replaced 
by an estimator R; satisfying max;<», ||R; — U;|| = op(1), where U;’s 
are correlation matrices. 


Show that (5.106) holds if and only if one of the following holds: 

(a) A —, 1 and Ay —, 1, where A_ and Ax are respectively the 
smallest and largest eigenvalues of ve ay. Vz me 

(b) IZ7Valn/l7Vnln —p 1, where {ly} is any sequence of nonzero vectors 
in R*. 


Show that (5.105) and (5.106) imply Va ‘/?(6n — 6) >a Nz(0, In). 


Suppose that X,,...,X, are independent (not necessarily identically 
distributed) random d-vectors with E(X;) = w for all i. Suppose also 
that sup; E||.X;||?+° < oo for some 6 > 0. Let pp = E(X1), 0 = g(p), 
and 6, = g(X). Show that 

(a) (5.105) holds with V,, = n~?[Vg(p)]7 S07, Var(Xi) Vg(u); 

(b) Vp, in (5.108) is consistent for V, in part (a). 


Consider the ratio estimator in Example 3.21. Derive the estimator 
V, given by (5.108) and show that V,, is consistent for the asymptotic 
variance of the ratio estimator. 


Derive a consistent variance estimator for R(t) in Example 3.23. 
Prove the claims in Example 5.16. 


Let oF, be given by (5.79) with F replaced by the empirical c.d.f. F),. 
(a) Show that of /n is the same as V, in (5.110) for an L-estimator 
with influence function ¢p. 

(b) Show directly (without using Theorem 5.15) 0% as, oj in 


(5.79), under the conditions in Theorem 5.6(i) or (ii) (with EX? < 
oo). 


Derive a consistent variance estimator for a U-statistic satisfying the 
conditions in Theorem 3.5(i). 


Derive a consistent variance estimator for Huber’s M-estimator dis- 
cussed in Example 5.13. 


Assume the conditions in Theorem 5.8. Let r € (0, 3). 
(a) Show that n"Ar(T(F,) +2-") >» Ar(T(F)). 

(b) Show that n”[Ap, (T(Fn) +77") — Ar(T(E,) +277)] Sp O. 

(c) Derive a consistent estimator of the asymptotic variance of T(F;,), 
using the results in (a) and (b). 


392 


103. 
104. 


105. 
106. 
107. 


108. 


109. 
110. 


111. 


112. 


113. 


114. 


115. 


116. 


5. Estimation in Nonparametric Models 


Prove Theorem 5.16. 


Let Xj,...,X;, be random variables and 6 = X?. ae that the 


jackknife estimator in (5.112) equals = ax 2 ye + Gait ae where 


¢c;’s are the sample central moments defined by (3.52). 


Prove Theorem 5.17 for the case where g is from R4 to R* and k > 2. 
Prove (5.114). 
In the proof of Theorem 5.18, prove (5.116). 


Show that 6_;’s in (5.118) satisfy (5.117), under the conditions of 
Theorem 5.14. 


Prove Theorem 5.19. 
Prove (5.119). 


Let Xj,...,Xn be random variables and 6 = X?. Show that the 
bootstrap variance estimator based on i.i.d. X;’s from F,, is equal to 


Vp = AX én += 4X5 +42 ce a , where ¢;’s are the sample central moments 
defined by (3. 52). 


Let G, Gi, Go,..., be c.d.f.’s on R. Suppose that 0..(G;,G) — 0 as 
j — co and G'(z) exists and is positive for all a € R. Show that 
G;'(p) — G"1(p) for any p € (0,1). 

Let X1,..., Xp be iid. from ac.df. F on R@ with a finite Var(X1). 
Let Xj, ...,X;, be iid. from the empirical c.d.f. F,. Show that for al- 
most all given sequences X1, X9,..., /n(X* —X) >a Na(0, Var(X1)). 
(Hint: verify Lindeberg’s condition.) 


Let X1,..., Xn be iid. from ac.df. F on R4, X¥,..., X* be iid. from 
the empirical c.d.f. F,, and let F* be the empirical c.d.f. based on 
X;’s. Using DKW’s inequality (Lemma 5.1), show that 

(a) 000(Ft, F) a.s. 0: 

(b) Qc0(F*, F) = Op(n-¥/2); 

(c) oy, (Fx, F) = Op(n~'/?), under the condition in Theorem 5.20(ii). 


Using the results from the previous two exercises, prove Theorem 
5.20(ii). 


Under the conditions in Theorem 5.11, establish a Bahadur’s repre- 
sentation for the bootstrap sample quantile 07. 


Chapter 6 


Hypothesis Tests 


A general theory of testing hypotheses is presented in this chapter. Let X 
be a sample from a population P in P, a family of populations. Based on 
the observed X, we test a given hypothesis Ho : P € Po versus H, : P € Pi, 
where Py and P, are two disjoint subsets of P and Pp UP) = P. Notational 
conventions and basic concepts (such as two types of errors, significance 
levels, and sizes) given in Example 2.20 and §2.4.2 are used in this chapter. 


6.1 UMP Tests 


A test for a hypothesis is a statistic T(X) taking values in [0,1]. When 
X = 2x is observed, we reject Ho with probability T(x) and accept Ho with 
probability 1—T (a). If T(X) = 1 or 0a.s. P, then T(X) is a nonrandomized 
test. Otherwise T(X) is a randomized test. For a given test T(X), the 
power function of T(X) is defined to be 


Br(P)=E[T(X)], PEP, (6.1) 
which is the type I error probability of T(X) when P € Pp and one minus 
the type II error probability of T(X) when P € P. 


As we discussed in §2.4.2, with a sample of a fixed size, we are not able 
to minimize two error probabilities simultaneously. Our approach involves 
maximizing the power 3r(P) over all P € P; (i.e., minimizing the type II 
error probability) and over all tests T satisfying 


sup Br(P) <a, (6.2) 
PEPo 


where a € [0,1] is a given level of significance. Recall that the left-hand 
side of (6.2) is defined to be the size of T. 


393 


394 6. Hypothesis Tests 


Definition 6.1. A test T, of size a is a uniformly most powerful (UMP) 
test if and only if Gr,(P) > Br(P) for all P€ P; and T oflevela. I 


If U(X) is a sufficient statistic for P € P, then for any test T(X), 
E(T|U) has the same power function as T and, therefore, to find a UMP 
test we may consider tests that are functions of U only. 


The existence and characteristics of UMP tests are studied in this sec- 
tion. 


6.1.1 The Neyman-Pearson lemma 


A hypothesis Hp (or H) is said to be simple if and only if Po (or P:) 
contains exactly one population. The following useful result, which has 
already been used once in the proof of Theorem 4.16, provides the form of 
UMP tests when both Ho and Hj are simple. 


Theorem 6.1 (Neyman-Pearson lemma). Suppose that Po = {Po} and 
Pi = {Pi}. Let f; be the p.df. of P; w.r.t. a o-finite measure v (e.g., 
vy=Pyp+P,),j7=0,1. 

(i) (Existence of a UMP test). For every a, there exists a UMP test of size 
a, which is equal to 


1 fi(X) > efo(X) 
T.(X)=4 y ~ fi(X) =cfo(X) (6.3) 
0 fi(X) < efo(X), 
where y € [0,1] and c > 0 are some constants chosen so that E/T,(X)] =a 
when P = Py (c = ov is allowed). 
(ii) (Uniqueness). If T;,.,. is a UMP test of size a, then 
1 fi(X) > efo(X) 


Tee(X) = { 0 fi(X) <efo(X) as. P. (6.4) 


Proof. The proof for the case of a = 0 or 1 is left as an exercise. Assume 
now that 0O<a<l. 

(i) We first show that there exist y and c such that Eo[T..(X)] = a, where 
E, is the expectation w.r.t. P;. Let y(t) = Po(fi(X) > tfo(X)). Then 
¥(t) is nonincreasing, 7(0) = 1, and y(co) = 0 (why?). Thus, there exists a 
c € (0,co) such that 7(c) < a < y(c—). Set 


ie es le“) #0) 
0 o(c-) = (6): 

Note that y(c—) — y(c) = P(fi(X) = cfo(X)). Then 
Eo(T.(X)] = Po(fi(X) > efo(X)) + yPo(fi(X) = efo(X)) = a. 


6.1. UMP Tests 395 


Next, we show that T; in (6.3) is a UMP test. Suppose that T(X) is a 
test satisfying Eo[T(X)] < a. If T. (a) — T(x) > 0, then T,(x) > 0 and, 
therefore, fi(z) > cfo(a). If Tx (%) — T(x) < 0, then T.(x) < 1 and, 
therefore, fi(a) < cfo(x). In any case, [T,(x) — T(x)][fi(x) — cfo(a)] > 0 
and, therefore, 


i) [F.(x) — T(x)][fu(e) — efo(a)|dv > 0, 


[it-@)-TOAw@d ec [E()-T@\folear. (65) 


The left-hand side of (6.5) is Ey [T.(X)] — £1 [T(X)] and the right-hand side 
of (6.5) is c{ Eo[T.(X)] — Eo[T(X)]} = cla — Eo[T(X)]} > 0. This proves 
the result in (i). 

(ii) Let T,,..(X) be a UMP test of size a. Define 


A={x:T,(@) #Tsx(@), fil) # cfo(@)}- 


Then [T., («)—T..(x)|[fi(v)—cfo(x)| > 0 when « € Aand=0whenz € A’®, 
and 


/ [F.() — Tox (#)] [fu (a) — efa(a)]dv = 0, 


since both T, and T,, are UMP tests of size a. By Proposition 1.6(ii), 
v(A) =0. This proves (6.4). If 


Theorem 6.1 shows that when both Ho and Hj are simple, there exists 
a UMP test that can be determined by (6.4) uniquely (a.s. P) except on 
the set B = {x: fi(x) = cfo(x)}. If v(B) = 0, then we have a unique 
nonrandomized UMP test; otherwise UMP tests are randomized on the set 
B and the randomization is necessary for UMP tests to have the given size 
a; furthermore, we can always choose a UMP test that is constant on B. 


Example 6.1. Suppose that X is a sample of size 1, Po = {Po}, and Pi = 
{Pi}, where Po is N(0,1) and P, is the double exponential distribution 
DE(0, 2) with the p.d.f. 4~te7!"!/2. Since P(fi(X) = cfo(X)) = 0, there is 
a unique nonrandomized UMP test. From (6.3), the UMP test T,(x) = 1 
if and only if Ze —lal > ¢? for some c > 0, which is equivalent to |z| > t 
or |c| < 1—t for some t > 4. Suppose that a < 4. To determine t, we use 


aw = Eo(T.(X)| = Po(|X| > t) + Po(|X| < 1-2). (6.6) 


Ift <1, then Po(|X| >t) > Po(|X| > 1) = 0.3374 > a. Hence t should be 
larger than 1 and (6.6) becomes 


a = P)(|X| >t) = (-t) +1- 82). 


396 6. Hypothesis Tests 


Thus, ¢ = ®-1(1 — a/2) and T,(X) = It,0)(|X|). Note that it is not 
necessary to find out what c is. 

Intuitively, the reason why the UMP test in this example rejects Ho 
when |X| is large is that the probability of getting a large |X| is much 
higher under Hj (i.e., P is the double exponential distribution DE(0, 2)). 

The power of T, when P € P, is 


t 
EEX) = Pi(lX|> 8) =1-5 [ el/2da = e/?, 4 


—t 


Example 6.2. Let X),...,X, be iid. binary random variables with p = 
P(X, = 1). Suppose that Ho : p = po and Hy : p = pi, where 0 < po < 
pi <1. By Theorem 6.1, a UMP test of size a is 


1 MY) > c 
T.Y)=¢ ¥ MY) =c 
0 MY) <c, 


where Y = >", X; and 


D Y 1- D n—-Y 
a) eg) 
) (8 Lo 
Since A(Y) is increasing in Y, there is an integer m > 0 such that 


1 Y>m 
TAY)=4 y Y=m 
0 Y<m, 


where m and ¥ satisfy a = Eo[T.(Y)] = Po(Y > m)+yPo(¥ =m). Since 
Y has the binomial distribution Bi(p,n), we can determine m and ¥ from 


a= > (")nha— may +o( ogra poy. 67) 


g=mt1 


Unless 


n 
n . ae 
aD ("\ xh - poy" : 
j=m+1 J 
for some integer m, in which case we can choose y = 0, the UMP test T,, is 
a randomized test. 


An interesting phenomenon in Example 6.2 is that the UMP test T, 
does not depend on p;. In such a case, T, is in fact a UMP test for testing 
Ho : p = po versus Hy, : p> po. 


6.1. UMP Tests 397 


Lemma 6.1. Suppose that there is a test T,. of size a@ such that for every 
P, € Pi, T, is UMP for testing Ho versus the hypothesis P = P,. Then T,, 
is UMP for testing Ho versus Hj. 

Proof. For any test T of level a, T is also of level a for testing Ho versus 
the hypothesis P = P,; with any P; € P:. Hence Br+(P1) > Br(Pi). 


We conclude this section with the following generalized Neyman-Pearson 
lemma. Its proof is left to the reader. Other extensions of the Neyman- 
Pearson lemma can be found in Exercises 8 and 9 in 86.6. 


Proposition 6.1. Let f1,..., fm+1 be Borel functions on R” that are inte- 


grable w.r.t. a o-finite measure v. For given constants t1,...,tm, let T be 
the class of Borel functions ¢ (from R” to [0, 1]) satisfying 


[fav < ti, t= 1, veey MN, (6.8) 


and Jo be the set of ¢’s in T satisfying (6.8) with all inequalities replaced 
by equalities. If there are constants c1,...,¢m such that 


= 1 fm4i(2) > c1 fi (2) Ne mgee Cm fm(2) 
wO=[9 PMocanasctene © 


is a member of Jo, then ¢, maximizes uf ofm+idv over d € To. If c; > 0 for 
all i, then ¢, maximizes [ ¢fm4idv over ¢e€T. I 


The existence of constants c,;’s in (6.9) is considered in the following 
lemma whose proof can be found in Lehmann (1986, pp. 97-99). 


Lemma 6.2. Let /i,..., fm and v be given by Proposition 6.1. Then the 
set M = {(f¢fidv,..., [ dfmdv) : ¢ is from R? to [0,1]} is convex and 
closed. If (ti,...,¢m) is an interior point of M, then there exist constants 
C1,-+-;€m such that the function defined by (6.9) isin Jo. I 


6.1.2 Monotone likelihood ratio 


The case where both Hog and Hy; are simple is mainly of theoretical inter- 
est. If a hypothesis is not simple, it is called composite. As we discussed 
in §6.1.1, UMP tests for composite Hy, exist in the problem discussed in 
Example 6.2. We now extend this result to a class of parametric problems 
in which the likelihood functions have a special property. 


Definition 6.2. Suppose that the distribution of X is in P = {Py : 6 € O}, 
a parametric family indexed by a real-valued 0, and that P is dominated 
by a o-finite measure v. Let fg = dPs/dv. The family P is said to have 


398 6. Hypothesis Tests 


monotone likelihood ratio in Y (X) (a real-valued statistic) if and only if, for 
any 01 < 02, fo,(x)/fo, (x) is a nondecreasing function of Y(x) for values x 
at which at least one of fo,(x) and fg,(x) is positive. 1 


The following lemma states a useful result for a family with monotone 
likelihood ratio. 


Lemma 6.3. Suppose that the distribution of X is in a parametric family 
P indexed by a real-valued @ and that P has monotone likelihood ratio in 
Y(X). If w is a nondecreasing function of Y, then g(@) = E[y(Y)] is a 
nondecreasing function of 6. 

Proof. Let 6; < 62, A= {x : fo,(x) > fo.(x)}, @ = supzey V(Y (2)), 
B={a: fo,(x) < fo.(x)}, and b = infyep y(Y(x)). Since P has monotone 
likelihood ratio in Y(X) and w is nondecreasing in Y, 6 > a. Then the 
result follows from 


9(62) — 9(01) = / bY (#))(Fon — fo)(@)dv 


IV 


a i (fon — fo, )(at)dv + i (fon — fo, (at)dv 
A B 


lI 
— 
om 

| 

a 
wa 

oo 

— 
= 

| 
= 
= 

8 
Q 

XN 


>0. I 


Before discussing UMP tests in families with monotone likelihood ratio, 
let us consider some examples of such families. 


Example 6.3. Let 6 be real-valued and 7(@) be a nondecreasing function 
of 6. Then the one-parameter exponential family with 


fo(@) = exp{n(@)Y (x) — €(8) f(x) (6.10) 


has monotone likelihood ratio in Y(X). From Tables 1.1-1.2 (81.3.1), this 
includes the binomial family {B2(0,r)}, the Poisson family {P(0)}, the neg- 
ative binomial family {NB(6,r)}, the log-distribution family {Z(@)}, the 
normal family {N(6,c”)} or {N(c, @)}, the exponential family {E(c, @)}, the 
gamma family {['(@,c)} or {I'(c, 0)}, the beta family {B(6,c)} or {B(c, 4)}, 
and the double exponential family {DE(c,0)}, where r or cis known. I 


Example 6.4. Let Xj,...,X, be ii.d. from the uniform distribution on 
(0,0), where 0 > 0. The Lebesgue p.d.f. of X = (X1,...,Xn) is fo(x) = 
6—"I(o,9)(@~m)), Where a(,) is the value of the largest order statistic Xn). 
For 6, < Ao, 

fo.() _ 9% I(o,62)(@(n)) 


fo,(z) 03 To,0,)(@(n))’ 


6.1. UMP Tests 399 


which is a nondecreasing function of x(,) for x’s at which at least one of 
fo, (x) and fo, (x) is positive, i-e., 27) < 92. Hence the family of distribu- 
tions of X has monotone likelihood ratio in X(p). 


Example 6.5. The following families have monotone likelihood ratio: 
(a) the double exponential distribution family {DE(0,c)} with a known c; 
(b) the exponential distribution family {E(0,c)} with a known c; 
(c) the logistic distribution family {LG(0,c)} with a known c; 
(d) the uniform distribution family {U(0, 0 + 1)}; 
e) the hypergeometric distribution family {HG(r,0,N — 0)} with known 
rand N (Table 1.1, page 18). 

An example of a family that does not have monotone likelihood ratio is 
the Cauchy distribution family {C(0,c)} witha knownc. I 


Hypotheses of the form Hp : 8 < 09 (or Ho : 6 > 00) versus Hy : 0 > A 
(or Hy : 0 < 09) are called one-sided hypotheses for any given constant 
89. The following result provides UMP tests for testing one-sided hypothe- 
ses when the distribution of X is in a parametric family with monotone 
likelihood ratio. 


Theorem 6.2. Suppose that X has a distribution in P = {Py : 0 € O} 
(© C R) that has monotone likelihood ratio in Y(X). Consider the problem 
of testing Hp : 8 < 09 versus H, : 6 > 69, where @ is a given constant. 

(i) There exists a UMP test of size a, which is given by 


1 Y(X)>e 
T.(X)=i y Y¥(X)=e (6.11) 
0 Y(X) <¢, 


where c and ¥ are determined by Br, (09) = a, and Gr(0) = E[T(X)] is the 
power function of a test T. 

(ii) Gr, (0) is strictly increasing for all 6’s for which 0 < Gr, (0) < 1. 

(iii) For any 0 < 09, T; minimizes Gr(0) (the type I error probability of T) 
among all tests T satisfying Br(0) = a. 

(iv) Assume that Po(fo(X) = cfo,(X)) = 0 for any 6 > 09 and c > 0, where 
fo is the p.d.f. of Py. If T is a test with Br(@0) = Gr, (00), then for any 
é> A, either Br (0) < Br, (8) or T= Ty, a.s. Po. 

(v) For any fixed 0), T, is UMP for testing Hp : 6 < @; versus H, :0> 61, 
with size Gr, (61). 

Proof. (i) Consider the hypotheses 6 = 09 versus 0 = 0, with any 6; > Oo. 
From Theorem 6.1, a UMP test is given by (6.3) with f; = the p.d.f. of Po,, 
j = 0,1. Since P has monotone likelihood ratio in Y(X), this UMP test 
can be chosen to be the same as T, in (6.11) with possibly different c and 
y satisfying Br, (@0) = a. Since T; does not depend on 6;, it follows from 


400 6. Hypothesis Tests 


Lemma 6.1 that T,. is UMP for testing the hypothesis 9 = 69 versus H,. 


Note that if T, is UMP for testing 6 = @ versus Hy, then it is UMP for 
testing Hp versus H,, provided that Gr,(@) < a for all @ < Oo, ie., the size 
of T, is a. But this follows from Lemma 6.3, i-e., Gr, (9) is nondecreasing 
in 6. This proves (i). 

(ii) See Exercise 2 in §6.6. 

(iii) The result can be proved using Theorem 6.1 with all inequalities re- 
versed. 

(iv) The proof for (iv) is left as an exercise. 

(v) The proof for (v) is similar to that of (i). 1 


By reversing inequalities throughout, we can obtain UMP tests for test- 
ing Hp : 8 > 0 versus Hy: @ < Op. 

A major application of Theorem 6.2 is to problems with one-parameter 
exponential families. 


Corollary 6.1. Suppose that X has the p.d.f. given by (6.10) w.r.t. a 
o-finite measure, where 7 is a strictly monotone function of 6. If 7 is 
increasing, then T,, given by (6.11) is UMP for testing Ho : 6 < 0 versus 
HI, : 60 > 00, where y and c are determined by Gr, (09) = a. If 7 is decreasing 
or Ho : 8 > 09 (Hi : 8 < Oo), the result is still valid by reversing inequalities 
in (6.11). 


Example 6.6. Let X1,..., Xp be iid. from the N(, 07) distribution with 
an unknown p € R and a known o?. Consider Ho :  < uo versus Hy : 
jt > fo, where pio is a fixed constant. The p.d.f. of X = (Xq,..., Xn) is of 
the form (6.10) with Y(X) = X and n(u) = nu/o?. By Corollary 6.1 and 
the fact that X is N(u,0?/n), the UMP test is T.(X) = I(c,,00)(X), where 
Ca = O2Z1-a/V/n + po and 2, = ®~1(a) (see also Example 2.28). # 


To derive a UMP test for testing Hp : 8 < 09 versus H, : 6 > 69 when 
X has the p.d.f. (6.10), it is essential to know the distribution of Y(X). 
Typically, a nonrandomized test can be obtained if the distribution of Y is 
continuous; otherwise UMP tests are randomized. 


Example 6.7. Let X),...,X, be iid. binary random variables with p = 
P(X, =1). The p.d-f. of X = (X4,..., Xn) is of the form (6.10) with Y = 
, Xi and n(p) = log Tere Note that 7(p) is a strictly increasing function 
of p. By Corollary 6.1, a UMP test for Ho : p < po versus Hy : p > po is 
given by (6.11), where c and ¥ are determined by (6.7) withc =m. 1 


Example 6.8. Let Xj,...,X, be i.i.d. random variables from the Poisson 
distribution P(@) with an unknown 6 > 0. The p.d.f. of X = (X1,..., Xn) 


6.1. UMP Tests 401 


is of the form (6.10) with Y(X) = )7_, X; and 7(6) = log@. Note that 
Y has the Poisson distribution P(n@). By Corollary 6.1, a UMP test for 
Hy : 80 < 05 versus H, : 8 > @ is given by (6.11) with c and ¥ satisfying 
foe) n0o 0 j n0o 0,)¢ 
a= See: o 0) Lge ON, Ms 0) . 4- 
Pico y! cl 


Example 6.9. Let X1,...,X, be i.i.d. random variables from the uniform 
distribution U(0,6), @ > 0. Consider the hypotheses Hp : 6 < 05 and 
Hy, : @ > 6. Since the p.d.f. of X = (X1,...,Xn) is in a family with 
monotone likelihood ratio in Y(X) = Xp) (Example 6.4), by Theorem 
6.2, a UMP test is of the form (6.11). Since X(,) has the Lebesgue p.d-f. 
nO—"a2"~" Ig 9)(a), the UMP test in (6.11) is nonrandomized and 


90 n 
n c 
a= Br, (8) = = | a” de =1- an’ 
9% c 9 
Hence c = 09(1 — a)!/". The power function of T, when 6 > 69 is 
6 
IgG - 
Br, (8) = = | "da = 1— ae 


In this problem, however, UMP tests are not unique. (Note that the 
condition P9(fo(X) = cfo,(X)) = 0 in Theorem 6.2(iv) is not satisfied.) It 
can be shown (exercise) that the following test is also UMP with size a: 

ip XxX > A 
T(X) = (n) 
( ) { a Xn) < A. 


6.1.3 UMP tests for two-sided hypotheses 


The following hypotheses are called two-sided hypotheses: 


Hyp: 0< 6; or 6 >6, versus Hy: 6, <0< 6, (6.12) 
Ho: 05 <@<62 versus Hy: 0< 6, or@d> A2, (6.13) 
Ho: 0=O0) versus Hy: 0494p, (6.14) 


where 69, 01, and 62 are given constants and 6) < 62. 


Theorem 6.3. Suppose that X has the p.d.f. given by (6.10) w.r.t. a o- 
finite measure, where 77 is a strictly increasing function of 0. 
(i) For testing hypotheses (6.12), a UMP test of size a is 


1 a < Y(X) <e 
T.(X)=i y Y(X)=c, i=1,2 (6.15) 
0 Y(X) <c, or Y(X) > c, 


402 6. Hypothesis Tests 


where c;’s and 7;’s are determined by 


Br,(01) = Br, (82) =a. (6.16) 


(ii) The test defined by (6.15) minimizes 67(@) over all 6 < 61, 0 > 02, and 
T satisfying Gr(01) = Pr(62) =a. 

(iii) If T, and T.., are two tests satisfying (6.15) and Gr, (61) = Gr,, (01) and 
if the region {T;. = 1} is to the right of {7 = 1}, then Gr,(0) < Gr,, (0) 
for 0 > 0, and Gr,(0) > Gr,,(0) for 6 < 0). If both T, and T,.. satisfy 
(6.15) and (6.16), then T, = Ti. a.s. P. 

Proof. (i) The distribution of Y has a p.d.f. 


go(y) = exp{n(O)y — €(4)} (6.17) 


(Theorem 2.1). Since Y is sufficient for 6, we only need to consider tests of 
the form T(Y). Let 0; < 03 < 02. Consider the problem of testing 0 = 6; 
or 0 = 62 versus 6 = 63. Clearly, (@,q@) is an interior point of the set of 
all points (Gr(@1), Gr(@2)) as T ranges over all tests of the form T(Y). By 
(6.17) and Lemma 6.2, there are constants ¢; and € such that 


1 ayetY + age2¥ <1 
0 ayetY + age2¥ > 1 


satisfies (6.16), where a; = Ge§(3)—§) and b; = n(6;) — n(O3), 1 = 1,2. 
Clearly a;’s cannot both be < 0. If one of the a,;’s is < 0 and the other 
is > 0, then aye"!¥ + age” is strictly monotone (since bj < 0 < bg) and 
T, or 1— T, is of the form (6.11), which has a strictly monotone power 
function (Theorem 6.2) and, therefore, cannot satisfy (6.16). Thus, both 
a;’s are positive. Then, T; is of the form (6.15) (since b1 < 0 < bz) and it 
follows from Proposition 6.1 that 7, is UMP for testing 0 = 6; or 0 = @ 
versus 0 = 03. Since T;, does not depend on 63, it follows from Lemma 6.1 
that T;, is UMP for testing 6 = 6; or 6 = @2 versus Hy. 


To show that T, is a UMP test of size a for testing Hp versus Hy, it 
remains to show that Br,(@) < a for 6 < 6; or @ > 6. But this follows 
from part (ii) of the theorem by comparing T, with the test T(Y) = a. 
(ii) The proof is similar to that in (i) and is left as an exercise. 

(iii) The first claim in (iii) follows from Lemma 6.4, since the function 
T. — T; has a single change of sign. The second claim in (iii) follows from 
the first claim. IH 


Lemma 6.4. Suppose that X has a p.d.f. in { fo(x) : 6 € O}, a parametric 
family of p.d.f.’s w.r.t. a single o-finite measure v on R, where O C R. 
Suppose that this family has monotone likelihood ratio in X. Let w be a 
function with a single change of sign. 

(i) There exists 09 € O such that Eg[w(X)] < 0 for 0 < 09 and Eg|y(X)] > 0 


6.1. UMP Tests 403 


for 6 > 69, where Eg is the expectation w.r.t. fo. 

(ii) Suppose that fg(x) > 0 for all x and @, that fo,(x)/fo(x) is strictly 
increasing in x for 6 < 61, and that v({x : v(x) 4 O}) > 0. If Eg, [e(X)] = 
0, then Eg|y(X)| < 0 for 6 < 0 and Eg|[t)(X)] > 0 for 0 > Oo. 

Proof. (i) Suppose that there is an 7p € R such that w(x) < 0 for x < 2 
and w(x) > 0 for x > xo. Let 6; < 62. We first show that Ep, [W(X)] > 0 
implies Eg, [W(X)] > 0. If fo, (v0) / fo, (vo) = co, then fe, (x) = 0 for x > x 
and, therefore, Eg, [(X)] < 0. Hence fo,(%0)/fo,(%0) = ¢ < oo. Then 
w(x) > 0 on the set A= {x: fo,(x) =0 and fo, (x) > 0}. Thus, 


Eno] > f wt fndy 
dv dv : 
> | bay + te (6.18) 
= cEp, (U(X)]. 


The result follows by letting @) = inf{@ : Eo[W(X)] > O}. 

(ii) Under the assumed conditions, fo,(20)/fo,(@o) = ¢ < co. The result 
follows from the proof in (i) with 6; replaced by 9 and the fact that > 
should be replaced by > in (6.18) under the assumed conditions. I 


Part (iii) of Theorem 6.3 shows that the c;’s and y;’s are uniquely de- 
termined by (6.15) and (6.16). It also indicates how to select the c;’s and 
7;’s. One can start with some trial values f°) and 7, find cf) and 40 
such that Sr, (01) = a, and compute Gr, (62). If br, (62) < a, by Theorem 
6.3(iii), the correct rejection region {T; = 1} is to the right of the one 
chosen so that one should try of) > oO) or cf) = ow and yf) < 40 the 
converse holds if Gr, (02) > a. 


Example 6.10. Let X1,..., Xn be iid. from N(6,1). By Theorem 6.3, a 
UMP test for testing (6.12) is T,(X) = I(c,,c.)(X), where c;’s are deter- 


mined by 
®(/n(co2 = 01)) = ®(J/n(cr1 = 01)) =a 
and 


&(/n(c2 = 02)) = &(J/n(c = 02)) =a. I 


When the distribution of X is not given by (6.10), UMP tests for hy- 
potheses (6.12) exist in some cases (see Exercises 17 and 26). Unfortunately, 
a UMP test does not exist in general for testing hypotheses (6.13) or (6.14) 
(Exercises 28 and 29). A key reason for this phenomenon is that UMP tests 
for testing one-sided hypotheses do not have level a for testing (6.12); but 
they are of level a for testing (6.13) or (6.14) and there does not exist a 
single test more powerful than all tests that are UMP for testing one-sided 
hypotheses. 


404 6. Hypothesis Tests 


6.2 UMP Unbiased Tests 


When a UMP test does not exist, we may use the same approach used 
in estimation problems, i.e., imposing a reasonable restriction on the tests 
to be considered and finding optimal tests within the class of tests under 
the restriction. Two such types of restrictions in estimation problems are 
unbiasedness and invariance. We consider unbiased tests in this section. 
The class of invariant tests is studied in §6.3. 


6.2.1 Unbiasedness, similarity, and Neyman structure 


A UMP test T of size a has the property that 
Br(P)<a, PEPo and 67(P)>a, PEP. (6.19) 


This means that T is at least as good as the silly test T = a. Thus, we 
have the following definition. 


Definition 6.3. Let a be a given level of significance. A test T' for Ho : 
P € Po versus H, : P € P, is said to be unbiased of level a if and only if 
(6.19) holds. A test of size a is called a uniformly most powerful unbiased 
(UMPU) test if and only if it is UMP within the class of unbiased tests of 
levela. JU 


Since a UMP test is UMPU, the discussion of unbiasedness of tests is 
useful only when a UMP test does not exist. In a large class of problems 
for which a UMP test does not exist, there do exist UMPU tests. 

Suppose that U is a sufficient statistic for P € P. Then, similar to the 
search for a UMP test, we need to consider functions of U only in order to 
find a UMPU test, since, for any unbiased test T(X), E(T|U) is unbiased 
and has the same power function as T. 


Throughout this section, we consider the following hypotheses: 
Hy :0€0o versus A, :0€O0,, (6.20) 
where 0 = 0(P) is a functional from P onto © and Oo and Oj are two 
disjoint Borel sets with O97 U0, = ©. Note that P; = {P : 0 € Oj}, 


j =0,1. For instance, X1,..., Xp, are i.i.d. from F but we are interested in 
testing Hp : 6 < 0 versus H, : 6 > 0, where 0 = EX, or the median of F. 


Definition 6.4. Consider the hypotheses specified by (6.20). Let a be a 
given level of significance and let Qo; be the common boundary of @9 and 
O01, ie., the set of points @ that are points or limit points of both Op» and 
@,. A test T is similar on Oo, if and only if 


Br(P) =a, 9€Oo. t (6.21) 


6.2. UMP Unbiased Tests 405 


It is more convenient to work with (6.21) than to work with (6.19) when 
the hypotheses are given by (6.20). Thus, the following lemma is useful. For 
a given test T, the power function @r(P) is said to be continuous in 6 if and 
only if for any {6; : 7 =0,1,2,...} C ©, 6; — implies Br(P;) — Br(Po), 
where P; € P satisfying 6(P;) = 0;, 7 =0,1,.... Note that if Gr is a function 
of 6, then this continuity property is simply the continuity of 6r(@). 


Lemma 6.5. Consider hypotheses (6.20). Suppose that, for every T, 
GBr(P) is continuous in @. If T. is uniformly most powerful among all tests 
satisfying (6.21) and has size a, then T;. is a UMPU test. 

Proof. Under the continuity assumption on (7, the class of tests satisfying 
(6.21) contains the class of tests satisfying (6.19). Since T;, is uniformly at 
least as powerful as the test T = a, T, is unbiased. Hence, T, is a UMPU 
test. I 


Using Lemma 6.5, we can derive a UMPU test for testing hypotheses 
given by (6.13) or (6.14), when X has the p.d.f. (6.10) in a one-parameter 
exponential family. (Note that a UMP test does not exist in these cases.) 
We do not provide the details here, since the results for one-parameter 
exponential families are special cases of those in §6.2.2 for multiparameter 
exponential families. To prepare for the discussion in §6.2.2, we introduce 
the following result that simplifies (6.21) when there is a statistic sufficient 
and complete for P € P = {P: 0(P) € Ooi}. 


Let U(X) be a sufficient statistic for P € P and let Py be the family of 
distributions of U as P ranges over P. If T is a test satisfying 


E(T(X)|UJ=a as. Py, (6.22) 


then 
E(T(X)| = E{E[P(X)|U]} =a PeP, 


ie., T is similar on Oo;. A test satisfying (6.22) is said to have Neyman 
structure w.r.t. U. If all tests similar on Oo; have Neyman structure w.r.t. 
U, then working with (6.21) is the same as working with (6.22). 


Lemma 6.6. Let U(X) be a sufficient statistic for P € P. Then a nec- 
essary and sufficient condition for all tests similar on 09; to have Neyman 
structure w.r.t. U is that U is boundedly complete for P € P. 

Proof. (i) Suppose first that U is boundedly complete for P € P. Let 
T(X) bea test similar on O91. Then E[T(X)—a] = 0 for all Pe P. From 
the boundedness of T'(X), E[I'(X)|U] is bounded (Proposition 1.10). Since 
E{E[T(X)|U] — a} = E[T(X) — a] =0 for all P € P, (6.22) holds. 

(ii) Suppose now that U is not boundedly complete for P € P. Then 
there is a function h such that |h(u)| < C, E[h(U)] =0 for all P € P, and 
h(U) 4 0 with positive probability for some P € P. Let T(X) = a+ch(U), 


406 6. Hypothesis Tests 


where c = min{a,1—a}/C. The result follows from the fact that T is a 
test similar on Og; but does not have Neyman structure w.r.t.U. I 


6.2.2 UMPU tests in exponential families 


Suppose that the distribution of X is in a multiparameter natural expo- 
nential family (§2.1.3) with the following p.d-f. w.r.t. a o-finite measure: 


fo,p(@) = exp {OY (a) + p'U(@) — C(8, 9) 5 (6.23) 


where @ is a real-valued parameter, y is a vector-valued parameter, and Y 
(real-valued) and U (vector-valued) are statistics. It follows from Theorem 
2.1(i) that the p.d-f. of (Y,U) (w.r.t. a o-finite measure) is in a natural 
exponential family of the form exp {Oy + y7™u — ¢(6, ~)} and, given U = u, 
the p.d.f. of the conditional distribution of Y (w.r.t. a o-finite measure v,,) 
is in a natural exponential family of the form exp {Oy — ¢,,(0)}. 


Theorem 6.4. Suppose that the distribution of X is in a multiparameter 
natural exponential family given by (6.23). 
(i) For testing Ho : 8 < 09 versus Hy, : 6 > 6), a UMPU test of size a is 


1 Y >c(U) 
T.Y,U)=4 UV)  Y=eW) (6.24) 
0 Y <c(U), 


where c(w) and y(w) are Borel functions determined by 
Eo,|T.(Y,U)|U = ul =a (6.25) 


for every u, and Eg, is the expectation w.r.t. fo, ,.- 
(ii) For testing hypotheses (6.12), a UMPU test of size a is 


1 ci(U) < Y < e@(U) 
T.(Y,U)=4 (0) Y¥=a(U), ¢=1,2, (6.26) 
0 Y <a(U) or Y > e(V), 


where c;(u)’s and +;(u)’s are Borel functions determined by 
Eo,[T.(¥,U)|U = u] = Eo,[T.(¥,U)|U =u] =a (6.27) 


for every u. 
(iii) For testing hypotheses (6.13), a UMPU test of size a is 


T.(Y,U) =< ¥(U) Y=e(U), i=1,2, (6.28) 
) 


6.2. UMP Unbiased Tests 407 


where c;(u)’s and 7;(w)’s are Borel functions determined by (6.27) for every 
U. 
(iv) For testing hypotheses (6.14), a UMPU test of size a is given by (6.28), 
where c;(u)’s and y;(u)’s are Borel functions determined by (6.25) and 


Eo.(T.(Y,U)Y|U = u] = aE, (Y|U = u) (6.29) 


for every u. 

Proof. Since (Y,U) is sufficient for (0, ~), we only need to consider tests 
that are functions of (Y,U). Hypotheses in (i)-(iv) are of the form (6.20) 
with Oo1 = {(0, ~) : 0 = M} or = {(0,¢) : 9 = 6;, i = 1,2}. In case (i) or 
(iv), U is sufficient and complete for P € P and, hence, Lemma 6.6 applies. 
In case (ii) or (iii), applying Lemma 6.6 to each {(0, y) : 6 = 0;} also shows 
that working with (6.21) is the same as working with (6.22). By Theorem 
2.1, the power functions of all tests are continuous and, hence, Lemma 6.5 
applies. Thus, for (i)-(iii), we only need to show that T;, is UMP among all 
tests T satisfying (6.25) (for part (i)) or (6.27) (for part (ii) or (iii)) with 
T, replaced by T. For (iv), any unbiased T should satisfy (6.25) with T, 
replaced by T and 


© BolT, U)] =0, 0 € Og. (6.30) 


By Theorem 2.1, the differentiation can be carried out under the expecta- 
tion sign. Hence, one can show (exercise) that (6.30) is equivalent to 


Eo,.(T(Y,U)Y — aY] =0, 60 € Oo1. (6.31) 


Using the argument in the proof of Lemma 6.6, one can show (exercise) 
that (6.31) is equivalent to (6.29) with T. replaced by T’. Hence, to prove 
(iv) we only need to show that T, is UMP among all tests T satisfying 
(6.25) and (6.29) with T, replaced by T. 


Note that the power function of any test T(Y,U) is 


br(0.0)= [| { Tu waPrw-ulw] Pow. 


Thus, it suffices to show that for every fixed u and #0 € 0,, T, maximizes 


[PawaPrvauv) 


over all T subject to the given side conditions. Since Pyjys, is in a 
one-parameter exponential family, the results in (i) and (ii) follow from 
Corollary 6.1 and Theorem 6.3, respectively. The result in (iii) follows 
from Theorem 6.3(ii) by considering 1 — T, with T, given by (6.15). To 


408 6. Hypothesis Tests 


prove the result in (iv), it suffices to show that if Y has the p.d.f. given 
by (6.10) and if U is treated as a constant in (6.25), (6.28), and (6.29), T, 
in (6.28) is UMP subject to conditions (6.25) and (6.29). We now omit 
U in the following proof for (iv), which is very similar to the proof of 
Theorem 6.3. First, (a,a£,(Y)) is an interior point of the set of points 
(Eo,(T(Y)], Ze, (T(Y)Y]) as T ranges over all tests of the form T(Y) (exer- 
cise). By Lemma 6.2 and Proposition 6.1, for testing 9 = 99 versus 0 = 44, 
the UMP test is equal to 1 when 


(ky + koy)e¥ < C(00,01)e", (6.32) 
where k;’s and C(6o, 91) are constants. Note that (6.32) is equivalent to 
a, + agy < eb 


for some constants a, a2, and b. This region is either one-sided or the 
outside of an interval. By Theorem 6.2(ii), a one-sided test has a strictly 
monotone power function and therefore cannot satisfy (6.29). Thus, this 
test must have the form (6.28). Since T, in (6.28) does not depend on 
6, by Lemma 6.1, it is UMP over all tests satisfying (6.25) and (6.29); in 
particular, the test = a. Thus, T, is UMPU. 


Finally, it can be shown that all the c- and y-functions in (i)-(iv) are 
Borel functions (see Lehmann (1986, p. 149)). I 


Example 6.11. A problem arising in many different contexts is the com- 
parison of two treatments. If the observations are integer-valued, the prob- 
lem often reduces to testing the equality of two Poisson distributions (e.g., 
a comparison of the radioactivity of two substances or the car accident rate 
in two cities) or two binomial distributions (when the observation is the 
number of successes in a sequence of trials for each treatment). 


Consider first the Poisson problem in which X; and X92 are indepen- 
dently distributed as the Poisson distributions P(A) and P(A2), respec- 
tively. The p.d.f. of X = (X41, X2) is 


e7~ (Ar +A2) 
a exp {22 log(A2/A1) + (a1 + £2) log Ax} (6.33) 
X1-XQ: 
w.r.t. the counting measure on {(i, 7): i= 0,1,2,...,7 =0,1,2,...$. Let @= 
log(A2/A1). Then hypotheses such as \; = Ag and Aj > Az are equivalent to 
6 = 0 and 6 < 0, respectively. The p.d.f. in (6.33) is of the form (6.23) with 
yp = logy\1, Y = X2, and U = X, + X2. Thus, Theorem 6.4 applies. To 
obtain various tests in Theorem 6.4, it is enough to derive the conditional 
distribution of Y = X2 given U = X,4+ X2 = u. Using the fact that 
X,+ X92 has the Poisson distribution P(A; + Az), one can show that 


6.2. UMP Unbiased Tests 409 


where p = A2/(A1 + Az) = e®/(1 +e). This is the binomial distribu- 
tion Bi(p,u). On the boundary set O91, 9 = 6; (a known value) and the 
distribution Py|y—, is known. 


The previous result can obviously be extended to the case where two 
independent samples, Xj1,...,Xin,;, 7 = 1,2, are iid. from the Poisson 
distributions P(A;), i = 1, 2, respectively. 

Consider next the binomial problem in which X,;, 7 = 1,2, are inde- 
pendently distributed as the binomial distributions Bi(p,;,n,;), 7 = 1,2, 
respectively, where n,;’s are known but p,;’s are unknown. The p.d.f. of 
X= (X1, X2) is 


n nm rn n = 
) ]) (1 — pi)""(1 — po)” exp {es log Sena + (@1 + £2) log ths 


w.r.t. the counting measure on {(i, 7) :7 = 0,1,...,21, 7 =0,1,...,n2}. This 
p.d.f. is of the form (6.23) with 6 = log ae Y = Xp, andU = X1+Xo. 
Thus, Theorem 6.4 applies. Note that hypotheses such as pj = p2 and 
pi = pz are equivalent to 6 = 0 and @ < 0, respectively. Using the joint 


distribution of (X1, X2), one can show (exercise) that 


Poy = yl =u) = 4()(™) (M2) Pray), w= 0.45 gm tra 
y i] 


= 
where A= {y: y = 0,1,...,min{u, no},u—y < mi} and 


-1 


K,(6) = ai as ) (Reyer (6.34) 


uU— 
yeA y 


If 6 = 0, this distribution reduces to a known distribution: the hypergeo- 
metric distribution HG(u,n2,n1) (Table 1.1, page 18). I 


Example 6.12 (2 x 2 contingency tables). Let A and B be two different 
events in a probability space related to a random experiment. Suppose that 
n independent trials of the experiment are carried out and that we observe 
the frequencies of the occurrence of the events AN B, AN B®, A°N B, and 
A°M B°. The results can be summarized in the following 2 x 2 contingency 
table: 


410 6. Hypothesis Tests 


The distribution of X = (X11, X12, X21, X22) is multinomial (Example 2.7) 
with probabilities p11, pi2, poi, and poo, where p;; = E(X;;)/n. Thus, the 
p.d.f. of X is 


a eas ST us 7D32eXP {eu log ¢ + 129 log Ts ze + x91 108g Bau al 
111-012-021 X22: 
w.r.t. the counting measure on the range of X. This p.d.f. is clearly of the 
form (6.23). By Theorem 6.4, we can derive UMPU tests for any parameter 
of the form 

9 = aolog F m+ ay log = P + ag log 


p22? 
where a,;’s are given constants. In particular, testing independence of A 
and B is equivalent to the hypotheses Hp : 6 = 0 versus Hi : 6 4 0 when 
ao = 1 and a; = ag = —1 (exercise). 
For hypotheses concerning 6 with ag = 1 and a; = ag = —1, the p.d-f. of 
X can be written as (6.23) with Y = Xy, and U = (X44 X12, X11 + Xa1). 
A direct calculation shows that P(Y = y|X11+ X12 = 11, X11. + X01 = m1) 


is equal to 
NY ne 0(mi—y) 
Km, (6 € I ; 
a ) ( y ) ( i ) A(y) 


where A = {y: y = 0,1,...,min{m1,1},mi — y < no} and K,(6) is 
given by (6.34). This distribution is known when @ = @; is known. In 
particular, for testing independence of A and B, @ = 0 implies that Pyjy=. 
is the hypergeometric distribution HG(m,1,n2), and the UMPU test in 
Theorem 6.4(iv) is also known as Fisher’s exact test. 


Suppose that X;;’s in the 2 x 2 contingency table are from two binomial 
distributions, i.e., X;1 is from the binomial distribution Bi(p;,n;), Xig = 
mn; — Xi, 71 = 1,2, and that X;1’s are independent. Then the UMPU test 
for independence of A and B previously derived is exactly the same as the 
UMPU test for p; = po given in Example 6.11. The only difference is that 
n,’s are fixed for testing the equality of two binomial distributions, whereas 
n,’s are random for testing independence of A and B. This is also true for 
the general r x c contingency tables considered in §6.4.3. If 


6.2.3 UMPU tests in normal families 


An important application of Theorem 6.4 to problems with continuous dis- 
tributions in exponential families is the derivation of UMPU tests in normal 
families. The results presented here are the basic justifications for tests in 
elementary textbooks concerning parameters in normal families. 


We start with the following lemma, which is useful especially when X 
is from a population in a normal family. 


6.2. UMP Unbiased Tests 411 


Lemma 6.7. Suppose that X has the p.d.f. (6.23) and that V(Y,U) is a 
statistic independent of U when 6 = 0;, where 6;’s are known values given 
in the hypotheses in (i)-(iv) of Theorem 6.4. 

(i) If V(y, u) is increasing in y for each u, then the UMPU tests in (i)-(iii) 
of Theorem 6.4 are equivalent to those given by (6.24)-(6.28) with Y and 
(Y,U) replaced by V and with c;(U) and 7;(U) replaced by constants c; 
and 7, respectively. 

(ii) If there are Borel functions a(u) > 0 and b(u) such that V(y,u) = 
a(u)y + b(u), then the UMPU test in Theorem 6.4(iv) is equivalent to that 
given by (6.25), (6.28), and (6.29) with Y and (Y,U) replaced by V and 
with c;(U) and 7;(U) replaced by constants c; and 7;, respectively. 
Proof. (i) Since V is increasing in y, Y > c;(u) is equivalent to V > d;(w) 
for some d;. The result follows from the fact that V is independent of U so 
that d;’s and y;’s do not depend on u when Y is replaced by V. 

(ii) Since V = a(U)Y + b(U), the UMPU test in Theorem 6.4(iv) is the 
same as 


1 V <a(U) or V > @(U) 
T.V,U)=4 u(U) V=a(U), += 1,2, (6.35) 
0 a(U <V < e/( ); 
subject to Eg, [T.(V,U)|U = u] = a and 
V-W)|] _ om [Y= bW) 


Under Eo,[T.(V,U)|U = u] = a, (6.36) is the same as Eg, [T.(V,U)V|U] = 
ako,(V|U). Since V and U are independent when 0 = 69, ci(u)’s and 
yi(u)’s do not depend on u and, therefore, T, in (6.35) does not depend on 
U. tf 


If the conditions of Lemma 6.7 are satisfied, then UMPU tests can 
be derived by working with the distribution of V instead of Pyjys,. In 
exponential families, a V(Y,U) independent of U can often be found by 
applying Basu’s theorem (Theorem 2.4). 


When we consider normal families, 7;’s can be chosen to be 0 since the 
c.d.f. of Y given U = u or the c.d.f. of V is continuous. 


One-sample problems 


Let X1,..., Xn be iid. from N(u,07) with unknown p € R and o? > 0, 
where n > 2. The joint p.d.f. of X = (X1,..., Xn) is 


1 ix 2 . my 
(Qno2)r?2 exp {ae + poe = QG2 ‘ 


412 6. Hypothesis Tests 


Consider first hypotheses concerning a”. The p.d.f. of X has the form 
(6:23) with @ = —(207)™, o = inp/o*, Y = 30, and U =X. By 
Basu’s theorem, V = (n — 1)S? is independent of U = X (Example 2.18), 
where S$? is the sample variance. Also, 


yx? (n — 1)S? + nX?, 


iie., V =Y —nU?. Hence the conditions of Lemma 6.7 are satisfied. Since 
V/o? has the chi-square distribution x2_, (Example 2.18), values of c;’s 
for hypotheses in (i)-(iii) of Theorem 6.4 are related to quantiles of y?_,. 
For bene nas 0 = 0 versus H, : 6 4 09 (which is equivalent to testing 
Ho : 0? = of versus H, : 0? #4 08), di = ci /02, i= 1,2, are determined by 


dz dz 
fn-i(v)dv =1-—a and dh vfn—1(v)dv = (n-1)(1- a), 

dy dy 
where fm is the Lebesgue p.d.f. of the chi-square distribution x2,. Since 
vfn—1(v) = (n— 1) fn41(v), di and dz are determined by 

d2 dz 

fn—i(v)dv = fnzi(v)dv =1-a. 

dy dy 
Ifn—12 +1, then d; and dz are nearly the (a/2)th and (1 — a/2)th 
quantiles of y?_,, respectively, in which case the UMPU test in Theorem 
6.4(iv) is the same as the “equal-tailed” chi-square test for Hp in elementary 
textbooks. 


Consider next hypotheses concerning ps. The p.d.f. of X has the form 
(6.23) with Y = X, U = S0_, (Xi — wo)”, 0 = n(u — po)/o?, and yp = 
—(207)~!. For testing hypotheses Ho : 1 < uo versus Hy : p> po, we take 
V to be t(X) = V/n(X — wo)/S. By Basu’s theorem, t(X) is independent 
of U when pu = plo. Hence it satisfies the conditions in Lemma 6.7(i). From 
Examples 1.16 and 2.18, t(X) has the t-distribution t,-1 when p = Uo. 
Thus, c(U) in Theorem 6.4(i) is the (1 — a)th quantile of t,_1. For the 
two-sided hypotheses Ho : = flo versus H, :  F po, the statistic V = 
(X —p0)/VU satisfies the conditions in Lemma 6.7(ii) and has a distribution 
symmetric about 0 when ps = to. Then the UMPU test in Theorem 6.4(iv) 
rejects Ho when |V| > d, where d satisfies P(|V| > d) = a when p = po. 


Since 
= /(n—1)nV(X )/ Vl-n[V 
the UMPU test oe Ho if and only if |t(X)| > te where tn—1,0 is 
the (1 — a)th quantile of the t-distribution t,_1. The UMPU tests derived 
here are the so-called one-sample t-tests in elementary textbooks. 
The power function of a one-sample t-test is related to the noncentral 
t-distribution introduced in §1.3.1 (see Exercise 36). 


6.2. UMP Unbiased Tests 413 


Two-sample problems 


The problem of comparing the parameters of two normal distributions arises 
in the comparison of two treatments, products, and so on (see also Example 
6.11). Suppose that we have two independent samples, Xj1,...,Xin,, ¢ = 
1,2, iid. from N(pi,07), i = 1,2, respectively, where n; > 2. The joint 
p.d.f. of Xij’s is 


re 


2 2 
~ 1V4 LG 
C111, H2, 07, 03) exp ~ 2. 368 at a , 


i=1 ~t 


where Z; is the sample mean based on 2j1,...,Uin, and C(-) is a known 
function. 

Consider first the hypothesis Ho : 03/07 < Ao or Ho : 03/0? = Ao. 
The p.d.f. of X;,’s is of the form (6.23) with 


| a ee _{_1 mei nete 
Digger -2a5° : Lat? ee: © ag 
neo n1 i n2 
Ye}, US Dee Seer X3;, X1, Xo 
; 0 
j=1 j=1 j=1 


To apply Lemma 6.7, consider 


Ve (nz — 1)S3/Ao _ __(¥ ~n2U3)/Ao 
(ny — 1)S?2 + (ng — 1).93/Ao Uy — n1U2 — n2U3/Ao’ 

where o is the sample variance based on Xj1,..., Xin, and U; is the jth 
component of U. By Basu’s theorem, V and U are independent when 
6 = 0 (03 = Aoo?). Since V is increasing and linear in Y, the condi- 
tions of Lemma 6.7 are satisfied. Thus, a UMPU test rejects Hp : 6 < 0 
(which is equivalent to Hp : 03/0? < Ao) when V > co, where co satisfies 
P(V > co) =a when 0 = 0; and a UMPU test rejects Hp : @ = 0 (which is 
equivalent to Ho : 03/07 = Ao) when V < c; or V > co, where c¢;’s satisfy 
Pla <V <c.) =1—aand E[VT.(V)] = aE(V) when 6 = 0. Note that 
(ng = 1)F 


Vv = ith F= 
ny — 1 + (ng = 1)F pe S? 


It follows from Example 1.16 that F has the F-distribution F,,~-1,n,-1 (Ta- 
ble 1.2, page 20) when 6 = 0. Since V is a strictly increasing function of 
F, a UMPU test rejects Hp : 6 < 0 when F > Fy,-1,n,-1,0, where Fiabe 
is the (1 — a)th quantile of the F-distribution F,,. This is the F-test in 
elementary textbooks. 


414 6. Hypothesis Tests 


When 6 = 0, V has the beta distribution B((nz — 1)/2, (ni — 1)/2) and 
E(V) = (n2 — 1)/(n1 + ne — 2) (Table 1.2). Then, E[VT,(V)] = ak(V) 
when @ = 0 is the same as 


(1 — a)(n2 — 1) ra 

m+ng-2 e1 UF (n2-1)/2,(n1-1)/2(v) dv, 
where fa, is the p.d-f. of the beta distribution B(a, b). Using the fact that 
Of frst /a(ai— DoW) = (ny + ng —- 2)! (ne = 1) f(mz+1)/2,(n1—-1)/2(U), we 
conclude that a UMPU test rejects Hp : € = 0 when V < c, or V > c&, 
where c; and cz are determined by 


c2 c2 
l-a= / f(nz—1)/2,(n1-1)/2(v) dv = / F(n+1)/2,(m1—1)/2(v) dv. 
C1 C1 


If no -—1 & no 4+ 1 (ie, ng is large), then this UMPU test can be ap- 
proximated by the F-test that rejects Hp : 6 = O if and only if F < 
Fett /2 08 F > Psi ta/2- 

Consider next the hypothesis Ho : 41 > pe or Ho: 1 = pg. If of 4 05, 
the problem is the so-called Behrens-Fisher problem and is not accessible by 
the method introduced in this section. We now assume that 07 = 03 = 07 
but o? is unknown. The p.d.f. of X;;’s is then 


4 ORs N22 _ 
C(p1, H2,07 ) exp ~ 9g2 se) i+ =) V2 5 


t=1 g=1 


which is of the form (6.23) with 


p= —Haa ta = ue + Ngp2 1 ) 


(ny +ny")o?’ (ny + ng)02” — 20? 


Y = Xo— X41, U = | 1X1+n2Xo, SS 


i=1 j=1 


For testing Hp : 6 < 0 (ie., 1 > fa) versus Hy, : 6 > 0, we consider V in 
Lemma 6.7 to be 


(X_ — X)/ ny +ng" 
(o= (6.37) 
[(m4 = iBiory + (ne mak 1).$3]/(ni + ng — 2) 


When 6 = 0, t(X) is independent of U (Basu’s theorem) and satisfies 
the conditions in Lemma 6.7(i); the numerator and the denominator of 
t(X) (after division by o) are independently distributed as N(0,1) and 


6.2. UMP Unbiased Tests 415 


the chi-square distribution coe 4ny—2) respectively. Hence t(X) has the t- 
distribution tp, +n.—-2 and a UMPU test rejects Hp when t(X) > tny4ns—2,0; 
where tn,4n.—2,a is the (1 — a)th quantile of the t-distribution tp, +n,—2- 
This is the so-called (one-sided) two-sample t-test. 

For testing Ho : 6 = 0 (ie., “1 = U2) versus H, : 6 £0, it follows from a 
similar argument used in the derivation of the (two-sided) one-sample t-test 
that a UMPU test rejects Ho when |t(X)| > try +n .—2,0/2 (exercise). This 
is the (two-sided) two-sample t-test. 

The power function of a two-sample t-test is related to a noncentral 
t-distribution. 


Normal linear models 


Consider linear model (3.25) with assumption Al, i-e., 
X =(Xj,.4Xn) is Nn(ZG,o7I,), (6.38) 


where @ is a p-vector of unknown parameters, Z is the n x p matrix whose 
ith row is the vector Z;, Z;’s are the values of a p-vector of deterministic 
covariates, and o? > 0 is an unknown parameter. Assume that n > p and 
the rank of Z is r < p. Let | € R(Z) (the linear space generated by the 
rows of Z) and 0 be a fixed constant. We consider the hypotheses 


Ho: 7B < Ao versus Ay: 7B > Ao (6.39) 


or 
Hy: B= versus A, : VBA. (6.40) 


Since H = Z(Z7Z)~ Z" is a projection matrix of rank r, there exists an 
n xX n orthogonal matrix I such that 


T=(T, fr.) and ALr=(T, 0), (6.41) 


where I’; ism xr and TP) isnx(n—r). Let Yj; =T7X, j = 1,2. Consider the 
transformation (Y;, Y2) =I'7X. Since [7T = J, and X is N;,(ZB,o7 In), 
(Yi, Y2) is N»(C7ZB,07I,). It follows from (6.41) that 

EW) = BU7X) S13 28 =T3HZB =o. 
Let 7 =T7 ZG = E(Y1). Then the p.d.f. of (Yi, Y2) is 

1 mY W¥all?+[)¥oll? lal? 

ee gee Eee 6.42 

(2ra2)n/2 exp { o? 20 20? om) 


Since / in (6.39) or (6.40) is in R(Z), there exists 1 € R” such that 1 = Z7X. 
Then : 
"6 = HX =NTITAX =NTUTX =NTIN, (6.43) 


416 6. Hypothesis Tests 


where (3 is the LSE defined by (3.27). By (6.43) and Theorem 3.6(ii), 
E(i’ 8) = 6 = XT, E(¥i) =a"n, 


where a = TJX. Let 7 = (m,..., 7) and a = (a,...,a,). Without loss of 
generality, we assume that a; #4 0. Then the p.d.f. in (6.42) is of the form 
(6.23) with 


a™n — 9% 1 12 Nr 
j= a ape ES Y =Y, 
ajo2 ) p ( 202? o2? 5) 1) ’ 11; 
20071 aoy, ar Y 
acs (mir + |[¥a|)? - "Ya - 4... Yar - —) 
1 a1 a1 


where Y;; is the jth component of Y;. By Basu’s theorem, 


JV/n—r(aTy, i 0) 


100 = "Tye al 


is independent of U when a™n = ITB = 6. Note that ||Yo||? = SSR in 
(3.35) and |lal|? = \°7T,L7A = A7HA =17(Z7Z)-1. Hence, by (6.43), 


ITB — 9 
Jit (Z7*Z)-1,/SSR/(n— 1) 


which has the t-distribution t,,_, (Theorem 3.8). Using the same arguments 
in deriving the one-sample or two-sample t-test, we obtain that a UMPU 
test for the hypotheses in (6.39) rejects Hp when t(X) > tn—r.q, and that a 
UMPU test for the hypotheses in (6.40) rejects Ho when |t(X)| > tr—ra/2- 


t(X) = 


Testing for independence in the bivariate normal family 


Suppose that Xj,...,X, are i.i.d. from a bivariate normal distribution, i.e., 
the p.d.f. of X = (X1,..., Xn) is 


—___1__ exp {- [L¥a =p |)? 
(2m0102\/1—p?)” 207 (1—p?) 


where Y; = (X1j,...,.Xnj) and X;; is the jth component of X;, 7 = 1, 2. 

Testing for independence of the two components of X, (or Y; and Y9) is 
equivalent to testing Hp : o = 0 versus H; : p £0. In some cases, one may 
also be interested in the one-sided hypotheses Ho : p < 0 versus H, : p > 0. 
It can be shown (exercise) that the p.d.f. in (6.44) is of the form (6.23) with 
d= and 


J T = = 2 
+ iy eee) _ eat : (6.44) 


—— Pe 
o102(1—p?) 


Y= > Xa Xia, U= (3%, SOG: So Xa, x2] ; 
j=l j=l j=l j=l 


i=1 


6.3. UMP Invariant Tests 417 


The hypothesis p < 0 is equivalent to 6 < 0. The sample correlation 
coefficient is 
Pa 1/2 


(Xi — X1)(Xi2 — Xs) / Soom — X\)? So (Xie — X.)? : 


i=l i=1 


R= 


1 


where X; is the sample mean of jj, ..., Xn ;, and is independent of U when 
p = 0 (Basu’s theorem), 7 = 1,2. To apply Lemma 6.7, we consider 


V =Vn—2R/V1— R?. (6.45) 


It can be shown (exercise) that R is linear in Y and that V has the t- 
distribution t,-2 when p = 0. Hence, a UMPU test for Hp : p < 0 versus 
Hf, : p > 0 rejects Hp when V > tpn—2.. and a UMPU test for Hp : p = 0 
versus H, : p # 0 rejects Ho when |V| > tp—2,0/2, where tn-2,0 is the 
(1 — a)th quantile of the t-distribution tp_2. 


6.3. UMP Invariant Tests 


In the previous section the unbiasedness principle is considered to derive 
an optimal test within the class of unbiased tests when a UMP test does 
not exist. In this section, we study the same problem with unbiasedness 
replaced by invariance under a given group of transformations. The prin- 
ciples of unbiasedness and invariance often complement each other in that 
each is successful in cases where the other is not. 


6.3.1 Invariance and UMPI tests 


The invariance principle considered here is similar to that introduced in 
§2.3.2 (Definition 2.9) and in §4.2. Although a hypothesis testing problem 
can be treated as a particular statistical decision problem (see, e.g., Ex- 
ample 2.20), in the following definition we define invariant tests without 
using any loss function which is a basic element in statistical decision the- 
ory. However, the reader is encouraged to compare Definition 2.9 with the 
following definition. 


Definition 6.5. Let X be a sample from P € P and G be a group (Defi- 
nition 2.9(i)) of one-to-one transformations of X. 

(i) We say that the problem of testing Hp : P € Po versus H, : P € P, is 
invariant under G if and only if both Pp and P are invariant under G in 
the sense of Definition 2.9(ii). 

(ii) In an invariant testing problem, a test T(X) is said to be invariant 


418 6. Hypothesis Tests 


under G if and only if 
T(g(x)) = T(2) for all x and g. (6.46) 


(iii) A test of size a is said to be a uniformly most powerful invariant 
(UMPI) test if and only if it is UMP within the class of level a tests that 
are invariant under G. 

(iv) A statistic M(X) is said to be maximal invariant under G if and only 
if (6.46) holds with T replaced by M and 


M (a1) = M (a2) implies 71 = g(x2) for some g EG. I (6.47) 


The following result indicates that invariance reduces the data X to a 
maximal invariant statistic M(X) whose distribution may depend only on 
a functional of P that shrinks P. 


Proposition 6.2. Let M(X) be maximal invariant under G. 

(i) A test T(X) is invariant under G if and only if there is a function h such 
that T(a) = h(M(a)) for all a. 

(ii) Suppose that there is a functional 6(P) on P satisfying 6(g(P)) = 6(P) 
for allg € G and P € P and 


0(P,) = 0(P2) implies P,; = g(P2) for some g € G 


(ie., O(P) is “maximal invariant”), where g(Px) = Pyx) is given in Defi- 
nition 2.9(ii). Then the distribution of M(X) depends only on 6(P). 
Proof. (i) If T(z) = h(M(a)) for all x, then T(g(x)) = h(M(g(x))) = 
h(M(«x)) = T(x) so that T is invariant. If T is invariant and if M(a) = 
M (az), then x1 = g(#2) for some g and T(a#1) = T(g(x2)) = T (a2). Hence 
T is a function of M. 

(ii) Suppose that 0(P,) = 0(P2). Then P2 = g(P,) for some g € G and for 
any event B in the range of M(X), 


P2(M(X) € B) = g(P:)(M(X) € B) 


Hence the distribution of M(X) depends only on 6(P). I 


In applications, maximal invariants M(X) and 6 = 0(P) are frequently 
real-valued. If the hypotheses of interest can be expressed in terms of #, then 
there may exist a test UMP among those depending only on M(X) (e.g., 
when the distribution of M(X) is in a parametric family having monotone 
likelihood ratio). Such a test is then a UMPI test. 


6.3. UMP Invariant Tests 419 


Example 6.13 (Location-scale families). Suppose that X has the Lebesgue 
pdf. fin(2) = fi(v1 — HW, ...,2n — pw), where n > 2, uw © R is unknown, and 
fi, 7 = 0,1, are known Lebesgue p.d.f.’s. We consider the problem of testing 


Ho: X is from fo, versus Hy: X is from fi. (6.48) 


Consider G = {g.:¢ € R} with g-(x) = (a1 +¢,...,2n +c). For any gc € G, 
it induces a transformation 9-(fi,) = fi,ute and the problem of testing Ho 
versus H, in (6.48) is invariant under G. 


We now show that a maximal invariant under G is D(X) =(Dj,..., Dn—1) 
= (X, — Xn, ..., Xn-1 — Xn). First, it is easy to see that D(X) is invariant 
under G. Let x = (41,...,%) and y = (y1,..-,Yn) be two points in the 
range of X. Suppose that 2; — %, = y; — Yn for i = 1,...,n —1. Putting 
C= Yn —Xn, we have y; = x; +c for all i. Hence, D(X) is maximal invariant 
under G. 


By Proposition 1.8, D has the p.d.f. f fi(di + t,...,dn—1 +t, t)dt under 
A;, 1 = 0,1, which does not depend on yp. In fact, in this case Proposition 
6.2 applies with M(X) = D(X) and 0(fi,,) = 7. If we consider tests that 
are functions of D(X), then the problem of testing the hypotheses in (6.48) 
becomes one of testing a simple hypothesis versus a simple hypothesis. By 
Theorem 6.1, the test UMP among functions of D(X), which is then the 
UMPI test, rejects Ho in (6.48) when 


ffildi +t,...,dn—-1 +t, t)dt = f fila +, ...,0n + t)dt 
ffol(di+t,...,dn1+t,t)dt  f fo(ai +t,...,¢n + t)dé 


where c is determined by the size of the UMPI test. 


The previous result can be extended to the case of a location-scale family 
where the p.d.f. of X is one of fio = ov fi(,..., tn—h) i = 0,1, 
fi,u,o is symmetric about ju, the hypotheses of interest are given by (6.48) 
with f;,, replaced by fic, and G = {ger : cE R,r #0} with g(x) = 
(rai tc, ...,7@n+c). When n > 3, it can be shown that a maximal invariant 
under G is W(X) = (Wi, wy Wr-2), where W; = (X; — Xn) /(Xn-1 = Xn), 
and that the p.d.f. of W does not depend on (4,0). A UMPI test can then 
be derived (exercise). I 


The next example considers finding a maximal invariant in a problem 
that is not a location-scale family problem. 


Example 6.14. Let G be the set of n! permutations of the components of 
xz €R”. Then a maximal invariant is the vector of order statistics. This is 
because a permutation of the components of « does not change the values 
of these components and two x’s with the same set of ordered components 
can be obtained from each other through a permutation of coordinates. 


420 6. Hypothesis Tests 


Suppose that P contains continuous c.d.f.’s on R”. Let G be the class of 
all transformations of the form g(x) = ((#1),...,.~(@n)), where ~ is contin- 
uous and strictly increasing. For 7 = (21,...,%n), let R(x) = (R1,..., Rn) be 
the vector of ranks (§5.2.2), ie., 2; = @(R,), where x(,) is the jth smallest 
value of x;’s. Clearly, R(g(x)) = R(x) for any g € G. For any x and y 
in R” with R(x) = R(y), define w(t) to be linear between a(;) and (541), 
j=1,..,n-1, 0) = t+(yqy)—2)) for t < 2), and Y(t) = t+ (Yn) —2n)) 
for t > a). Then (aj) = o(y%), 1 = 1,...,.n. This shows that the vector 
of rank statistics is maximal invariant. J 


When there is a sufficient statistic U(X), it is convenient first to reduce 
the data to U(X) before applying invariance. If there is a test T(U) UMP 
among all invariant tests depending only on U, one would like to conclude 
that T(U) is a UMPI test. Unfortunately, this may not be true in general, 
since it is not clear that for any invariant test based on X there is an 
equivalent invariant test based only on U(X). The following result provides 
a sufficient condition under which it is enough to consider invariant tests 
depending only on U(X). Its proof is omitted and can be found in Lehmann 
(1986, pp. 297-302). 


Proposition 6.3. Let G be a group of transformations on X (the range of 
X) and (G, Bg, A) be a measure space with a o-finite A. Suppose that the 
testing problem under consideration is invariant under G, that for any set 
A € By, the set of points (#,g) for which g(x) € A is in o(By x Bg), and 
that A(B) = 0 implies A({hog:h € B}) =0 for all g € G. Suppose further 
that there is a statistic U(X) sufficient for P € P and that U(a21) = U(a2) 
implies U(g(#1)) = U(g(a2)) for all g € G so that G induces a group Gy of 
transformations on the range of U through gy(U(x)) = U(g(x)). Then, for 
any test T(X) invariant under G, there exists a test based on U(X) that is 
invariant under G (and Gy) and has the same power function as T(X). UI 


In many problems g(x) = w(a,g), where g ranges over a set G in R™ 
and w is a Borel function on R"*™. Then the measurability condition in 
Proposition 6.3 is satisfied by choosing Bg to be the Borel o-field on G. 
In such cases it is usually not difficult to find a measure X satisfying the 
condition in Proposition 6.3. 


Example 6.15. Let X1,..., Xp be iid. from N(y, 07) with unknown p € R 
and o? > 0. The problem of testing Hp : 0? > 0% versus Hy : 0? < of 
is invariant under G = {g.: c € R} with g-(x) = (a1 +.¢,...,¢n +0). It 
can be shown (exercise) that G and the sufficient statistic U = (X, $7) 
satisfy the conditions in Proposition 6.3 with Gy = {he : c € R} and 
he(u1, U2) = (uy + ¢,u2), and that S? is maximal invariant under Gy. It 
follows from Proposition 6.3, Corollary 6.1, and the fact that (n —1)S$?/oé 


6.3. UMP Invariant Tests 421 


has the chi-square distribution x2_, when 0? = o? that a UMPI test of size 
a@ rejects Hy when (n—1)S?/o5 < x71) Where x7,_1., is the (1— a)th 
quantile of the chi-square distribution .?_,. This test coincides with the 
UMPU test given in 86.2.3. I 


Example 6.16. Let Xj1,...,Xin,, 7 = 1,2, be two independent samples 
iid. from N(;,07), 7 = 1,2, respectively. The problem of testing Ho : 
o3/0? < Ao versus Hy : 03/07 > Ao is invariant under 


G = {Gey,c.,r 1 Ci € R,t = 1,2,r > OF 
with 
Gey ,co,r(41, U2) = (TH. + C1, TLin, + C1, 7H21 + C2, +0 TLIny + C2). 


It can be shown (exercise) that the sufficient statistic U = (X 1, X2, $7, $3) 
and G satisfy the conditions in Proposition 6.3 with 


Gu = {eyiea :q¢ €R,ti=1,2,r> 0} 


and 


Ney ,eo,r (U1, U2, U3, U4) = (TU + C1, TU + C2, Tug, TU4). 


A maximal invariant under Gy is S2/S . Let A = 03/o7. Then ($3/5?)/A 
has an F-distribution and, therefore, V = S3/S? has a Lebesgue p.d_-f. of 
the form 


fa(v) = C(A)u™ 971A + (ng — Yo/(na — DIT 4-70.50) (0); 


where C'(A) is a known function of A. It can be shown (exercise) that the 
family {fa : A > 0} has monotone likelihood ratio in V so that a UMPI test 
of size a rejects Hyp when V > Fy,-1.n,-1,0, Where Fy,p, is the (1 — a)th 
quantile of the F-distribution F,». Again, this UMPI test coincides with 
the UMPU test given in 86.2.3. 


The following result shows that, in Examples 6.15 and 6.16, the fact that 
UMPI tests are the same as the UMPU tests is not a simple coincidence. 


Proposition 6.4. Consider a testing problem invariant under G. If there 
exists a UMPI test of size a, then it is unbiased. If there also exists a 
UMPU test of size a that is invariant under G, then the two tests have the 
same power function on P € P,. If either the UMPI test or the UMPU test 
is unique a.s. P, then the two tests are equal a.s. P. 

Proof. We only need to prove that a UMPI test of size a is unbiased. This 
follows from the fact that the test T =a is invariant under G.I 


422 6. Hypothesis Tests 


The next example shows an application of invariance in a situation 
where a UMPU test may not exist. 


Example 6.17. Let Xj,..., Xp be iid. from N(y,07) with unknown pu 
and a7. Let @ = (uw —u)/o, where u is a known constant. Consider the 
problem of testing Ho : 8 < @ versus H, : 86 > 0. Note that Ho is the 
same as P(X, <u) > po for a known constant po = ®(—69). Without loss 
of generality, we consider the case of u = 0. 

The problem is invariant under G = {g, : r > 0} with g,(x) = rx. By 
Proposition 6.3, we can consider tests that are functions of the sufficient 
statistic (X, $7) only. A maximal invariant under G is t(X) = /nX/S. To 
find a UMPI test, it remains to find a test UMP among all tests that are 
functions of t(X). 


From the discussion in §1.3.1, t(X) has the noncentral t-distribution 
tn—1(,/n8). Let f(t) be the Lebesgue p.d.f. of t(X), ie., fg is given by 
(1.32) with n replaced by n — 1 and 6 = /n@. It can be shown (exercise) 
that the family of p.d.f.’s, {fo(t) : 6 € R}, has monotone likelihood ratio in 
t. Hence, by Theorem 6.2, a UMPI test of size a rejects Hp when t(X) > c, 
where c is the (1 — a)th quantile of tp_1(./no). 

In some problems, we may have to apply both unbiasedness and invari- 
ance principles. For instance, suppose that in the current problem we would 
like to test Ho : @ = 0 versus H, : 6 4 0. The problem is still invariant 
under G. Following the previous discussion, we only need to consider tests 
that are functions of t(X). But a test UMP among functions of t(X) does 
not exist in this case. A test UMP among all unbiased tests of level a that 
are functions of t(X) rejects Ho when t(X) < c, or t(X) > cg, where cy 
and cz are determined by 


[ fo, (t)dt =l-a and “ [a falta 


(see Exercise 26). This test is then UMP among all tests that are invariant 
and unbiased of level a. Whether it is also UMPU without the restriction 
to invariant tests is an open problem. IJ 


=0 
0=60 


6.3.2 UMPI tests in normal linear models 
Consider normal linear model (6.38): 
X =N,(ZG,07 In), 


where ( is a p-vector of unknown parameters, 0? > 0 is unknown, and Z is 
a fixed n x p matrix of rank r < p< n. In §6.2.3, UMPU tests for testing 


6.3. UMP Invariant Tests 423 


(6.39) or (6.40) are derived. A frequently encountered problem in practice 
is to test 
Hy): LB=0 versus A,: LGB #0, (6.49) 


where L is an s x p matrix of rank s < r and all rows of L are in R(Z). 
However, a UMPU test for (6.49) does not exist if s > 1. We now derive 
a UMPI test for testing (6.49). We use without proof the following result 
from linear algebra: there exists an orthogonal matrix T such that (6.49) 
is equivalent to 


Ho:m =0 versus Ay:m £40, (6.50) 


where 7 is the s-vector containing the first s components of 7, 7 is the 
r-vector containing the first r components of [ZG, and the last n — r com- 
ponents of [ZG are 0’s. Let Y =T'X. Then Y = N,,((n, 0), 07I,) with the 
p.d.f. given by (6.42). Let Y = (1, Y2), where Y) is an r-vector, and let 
Y, = (Yi1, Yi2), where Yi; is an s-vector. Define 


G={gnrey:cER" *, y>0, Ais an s x s orthogonal matrix} 


with 
GA,c,y(Y) = WAY i, Yia + ©, Yo). 


Testing (6.50) is invariant under G. By Proposition 6.3, we can restrict our 
attention to the sufficient statistic U = (Yi, ||Y2||?). The statistic 


M(U) = ||Yull?/[¥ell? (6.51) 


is invariant under Gy, the group of transformations on the range of U 
defined by §a,c,y(U(Y)) =U (ga,c,y(Y)). We now show that M(U) is max- 
imal invariant under Gy. Let 1; € R*®, 1; # 0, and t; € (0,00), ¢ = 1,2. 
If |[la||?/t7 = |\l2||?/#3, then t1 = yt2 with y = |[hi|]/|ll2l|. Since li /(|d4|| 
and [2/||l2|| are two points having the same distance from the origin, there 
exists an orthogonal matrix A such that J,/|[1|| = Al2/||l2||, i-e., ty = yAle. 
This proves that if M(u)) = M(u®) with vw) = ooo? 2), then 
y) = vAy® and t; = yt2 for some y > 0 and orthogonal matrix A and, 
therefore, u)) = 9p o,4(u) with e = y~1y(} — y@). Thus, M(U) is maxi- 
mal invariant under Gy. 

It can be shown (exercise) that W = M(U)(n—r)/s has the noncentral 
F-distribution F,,—,(@) with 6 = ||7||?/o7 (see §1.3.1). Let fg(w) be the 
Lebesgue p.d.f. of W, ie., fg is given by (1.33) with ny = s, ng =n-—T, 
and 6 = @. Note that under Ho, 6 = 0 and fg reduces to the p.d.f. of the 
central F-distribution F,-, (Table 1.2, page 20). Also, it can be shown 
(exercise) that the ratio fo, (w)/fo(w) is an increasing function of w for any 
given 6; > 0. By Theorem 6.1, a UMPI test of size a for testing Ho : 0 =0 


424 6. Hypothesis Tests 


versus H, : 0 = 6; rejects Hp when W > Fy n—ra, where Fs n—r,q is the 
(1 — a)th quantile of the F-distribution F,,_,. Since this test does not 
depend on 6;, by Lemma 6.1, it is also a UMPI test of size a for testing 
Hy : 6 =0 versus H, : 6 > 0, which is equivalent to testing (6.50). 


In applications it is not convenient to carry out the test by finding 
explicitly the orthogonal matrix I. Hence, we now express the statistic W 
in terms of X. Since Y =TX and E(Y) =TE(X) =T ZZ, 


[Ya — nl? + [l¥all? = |X — Z6l|? 
and, therefore, 


min ||[Yi — all? + [l¥all? = me — ZpI|?, 


which is the same as 
[[¥oll? = |X - ZG)? = SSR, 
where (3 is the LSE defined by (3.27). Similarly, 


Yis|l? + |/¥ol/? = min ||X — Z|l?. 
Ma? + 7 = min |X - Z| 


If we define By, to be a solution of 
_ 7A. |2— mi we 2 
|X — ZB x ||" = ,min |X — ZAI", 


which is called the LSE of @ under Ho or the LSE of @ subject to LG = 0, 
then , . 
(|X = ZB rr0||? = |X — Z6I)")/s_ 
|X — ZB||?/(n— 1) 
Thus, the UMPI test for (6.49) can be used without finding I. 


When s = 1, the UMPI test derived here is the same as the UMPU test 
for (6.40) given in §6.2.3. 


W= (6.52) 


Example 6.18. Consider the one-way ANOVA model in Example 3.13: 
Xij = N(i,07), QS Vy ny 7=1,...,m, 


and X;;’s are independent. A common testing problem in applications is 
the test for homogeneity of means, i.e., 


Ao: fy =+++ = bm versus Ay: w; A Le for somei#~k. (6.53) 


One can easily find a matrix L for which (6.53) is equivalent to (6.49). 
But it is not necessary to find such a matrix in order to compute the 


6.3. UMP Invariant Tests 425 


statistic W that defines the UMPI test. Note that the LSE of (11, ...,; Um) 
is (X1.,...,Xm.), where Xj. is the sample mean based on Xj1,..., Xin;, and 
the LSE under Hp is simply X, the sample mean based on all X;;’s. Thus, 


SSR = |X - ZA? =>) > (Ky - Xi)’, 
i=1 j=1 
SST =||X — ZBu |? => (Xi — XP’, 
i=1 j=l 


and 


SSA = SST —- SSR=)_ni(Xi. — X/’. 
t=1 
Then 
SSA/(m—1) 
SSR/(n—m)’ 


where n = )>i", n;. The name ANOVA comes from the fact that the UMPI 
test is carried out by comparing two sources of variation: the variation 
within each group of observations (measured by SSR) and the variation 
among m groups (measured by $A), and that SSA+ SSR = SST is the 
total variation in the data set. 


w= 


In this case, the distribution of W can also be derived using Cochran’s 
theorem (Theorem 1.5). See Exercise 75. Il 


Example 6.19. Consider the two-way balanced ANOVA model in Exam- 
ple 3.14: 


Xijk = N (ij, 07), = 1, ooey Oy j = i 265 k= i Ree 6 


b b 
where pig = W+ait Bj +7ij, Dja1 8 = ja BF = Via Vid = jan Ys = 
0, and X;;,’s are independent. Typically the following hypotheses are of 
interest: 


Ho : a; = 0 for all i versus Hy, : a; #0 for some 3, (6.54) 
Ho : B; =0 for all j versus Hy, : 3; £0 for some J, (6.55) 

and 
Ho : yj = 0 for all i, j versus Hy: %; #0 for some i,j. (6.56) 


In applications, a;’s are effects of a factor A (a variable taking finitely many 
values), 3;’s are effects of a factor B, and 7j,;’s are effects of the interaction 
of factors A and B. Hence, testing hypotheses in (6.54), (6.55), and (6.56) 


426 6. Hypothesis Tests 


are the same as testing effects of factor A, of factor B, and of the interaction 
between A and B, respectively. 


The LSE’s of pz, a;, 3;, and 7%; are given by (Example 3.14) fi = X..., 
on = Xess = Keg, B; = X.5. — Xt arr = Xij. = Xj. _ X.5. + - and a dot 
is used to denote averaging over the indicated subscript. Let 


a b c 
SSR= S- S- So (Xign — Xi;.)°, 
j=1 j=l k=l 


SSA=bc) (X.. — X...)?, 


i=1 


and 


Then, one can show (exercise) that for testing (6.54), (6.55), and (6.56), 
the statistics W in (6.52) (for the UMPI tests) are, respectively, 


SSA/(a—1) SSB/(b—1) 7 S8CMa=1)b= V) 
SSR/l(c—lab])’ SSR/[(c—lab) * SSR/[(c — Hab] 


We end this section with a discussion of testing for random effects in 
the following balanced one-way random effects model (Example 3.17): 


Xi; = p+ Aj t exj, ‘= see aoe 0 (6.57) 


where yp is an unknown parameter, A;’s are i.i.d. random effects from 
N(0,02), e;;’s are iid. measurement errors from N(0,07), and A;’s and 
ejj’8 are independent. Consider the problem of testing 


Ho:02/07< Ao ~~ versus = Hy : 02/02 > Ao (6.58) 


for a given Ap. When Ao is small, hypothesis Ho in (6.58) means that the 
random effects are negligible relative to the measurement variation. 


Let (Yin,..., Yin) =I (Xi, ..., Xi), where T is a b x b orthogonal matrix 
whose elements in the first row are all equal to 1/ Vb. Then 


Yi = VbX;. = Vb(ut+ A; + &;.), ead beeen 


6.3. UMP Invariant Tests 427 


are iid. from N(Vbp, 0? + bo?), Vij, i =1,...,0,7 = 2,...,0, are iid. from 
N(0,07), and Yj;’s are independent. The reason why E(Y;;) = 0 when 
j > 1 is because row j of I is orthogonal to the first row of T. 

Let A be an a x a orthogonal matrix whose elements in the first row are 
all equal to 1/\/a@ and (Uy, ..., Ua1) = A(Yi1, -.-, Yar). Then Uy, = VaY., is 
N(Vabp, 0? + ba2), Ui, i = 2,...,a, are from N(0,0? + ba2), and Uj1’s are 
independent. Let Uj; = Yi; for 7 = 2,...,b,i1 =1,..., a. 

The problem of testing (6.58) is invariant under the group of transfor- 
mations that transform Uj; to rUj, +c and Uj; to rUj;, (i,j) F (1, 1), where 
r >Oandcé€ R. It can be shown (exercise) that the maximal invariant 
under this group of transformations is S:S.A/SSR, where 


a a b 
SSA=)\Ui, and SSR=)_)_ Uj. 
i=2 i=1 j=2 
Note that Ho in (6.58) is equivalent to (a? + bo?)/a? < 1+ bAo. Also, 
SSA/(c? + bo?) has the chi-square distribution y?_, and SSR/o? has the 
chi-square distribution Xa(b-1)° Hence, the p.d.f. of the statistic 


= 1 SSA/(a—1) 

~ 14 dAo SSR/[a(b— 1)] 
is in a parametric family (indexed by the parameter (a? + bo?)/o) with 
monotone likelihood ratio in W. Thus, a UMPI test of size a for test- 
ing (6.58) rejects Hp when W > Fo_1,4(b-1),0, Where Fo_1,a(b-1),0 is the 
(1 —a)th quantile of the F-distribution Fy_1,a(»-1). 


It remains to express W in terms of X;;’s. Note that 


b 


a b a b a 
Sah= yD GHD, ex, — be? | = >) > (Xig — Xi)? 
1 


i=1 j=2 i=1 \j= i=1 j=l 


and 
SSA=S°U,-Ui, => (Yi -a¥{ =b) (Xi. -— X.). 
t=1 t=1 t=1 


The SSR and SS'A derived here are the same as those in Example 6.18 
when n; = 0 for all i and m = a. It can also be seen that if Ap = 0, 
then testing (6.58) is equivalent to testing Ho : 7? = 0 versus Hy, : 0? > 0 
and the derived UMPI test is exactly the same as that in Example 6.18, 
although the testing problems are different in these two cases. 


Extensions to balanced two-way random effects models can be found in 
Lehmann (1986, §7.12). 


428 6. Hypothesis Tests 


6.4 Tests in Parametric Models 


A UMP, UMPU, or UMPI test often does not exist in a particular prob- 
lem. In the rest of this chapter, we study some methods for constructing 
tests that have intuitive appeal and frequently coincide with optimal tests 
(UMP or UMPU tests) when optimal tests do exist. We consider tests in 
parametric models in this section, whereas tests in nonparametric models 
are studied in §6.5. 


When the hypothesis Ho is not simple, it is often difficult or even im- 
possible to obtain a test that has exactly a given size a, since it is hard to 
find a population P that maximizes the power function of the test over all 
P € Po. In such cases a common approach is to find tests having asymp- 
totic significance level a (Definition 2.13). This involves finding the limit 
of the power of a test at P € Po, which is studied in this section and 86.5. 

Throughout this section, we assume that a sample X is from P € P = 
{Po5:90€O}1,OCR*, fo = ao exists w.r.t. a o-finite measure vy for all 6, 
and the testing problem is 


Hy) :90€ Oo versus A, :0€O0,, (6.59) 
where Oo U 0, = O and Oo MN 0; = 0. 


6.4.1 Likelihood ratio tests 


When both Hp and H; are simple (i.e., both Op = {00} and 0, = {6} are 
single-point sets), Theorem 6.1 applies and a UMP test rejects Hp when 


LHe ee (6.60) 


foo (X) 

for some co > 0. When cp > 1, (6.60) is equivalent to (exercise) 
foo (X) 

max{ fo, (X), fo (X)} 


for some c € (0,1). The following definition is a natural extension of this 
idea. 


<c (6.61) 


Definition 6.6. Let £(0) = fg(X) be the likelihood function. For testing 
(6.59), a likelihood ratio (LR) test is any test that rejects Ho if and only if 
A(X) <c, where c € [0,1] and A(X) is the likelihood ratio defined by 


sup £(0) 
0EOo 

MA) = sae) 
dEO 


6.4. Tests in Parametric Models 429 


If A(X) is well defined, then A(X) < 1. The rationale behind LR tests is 
that when Hp is true, A(X) tends to be close to 1, whereas when Hy, is true, 
A(X) tends to be away from 1. If there is a sufficient statistic, then A(X) 
depends only on the sufficient statistic. LR tests are as widely applicable 
as MLE’s in 84.4 and, in fact, they are closely related to MLE’s. If 6 is an 
MLE of 6 and 6p is an MLE of 6 subject to @ € Qo (i.e., Qo is treated as 
the parameter space), then 


MX) = €(80)/6(8). 
For a given a € (0,1), if there exists a cg € [0,1] such that 


sup P9(A(X) < ca) = a, (6.62) 
0EOo 
then an LR test of size a can be obtained. Even when the c.d.f. of A(X) is 
continuous or randomized LR tests are introduced, it is still possible that 
a Co satisfying (6.62) does not exist. 
When a UMP or UMPU test exists, an LR test is often the same as this 
optimal test. For a real-valued 6, we have the following result. 


Proposition 6.5. Suppose that X has the p.d.f. given by (6.10) w.r.t. a o- 
finite measure v, where 77 is a strictly increasing and differentaible function 
of 0. 

(i) For testing Hp : 6 < 0 versus H; : 0 > Op, there is an LR test whose 
rejection region is the same as that of the UMP test T, given by (6.11). 
(ii) For testing the hypotheses in (6.12), there is an LR test whose rejection 
region is the same as that of the UMP test T, given by (6.15). 

(iii) For testing the hypotheses in (6.13) or (6.14), there is an LR test 
whose rejection region is equivalent to Y(X) < cy or Y(X) > cg for some 
constants c, and co. 

Proof. (i) Let 6 be the MLE of 9. Note that (0) is increasing when 6 < 6 
and decreasing when 6 > 6. Thus, 


Then \(X) < c is the same as 6 > 05 and £(09)/€(6) < c. From the 
discussion in §4.4.2, 6 isa strictly increasing function of Y. It can be 
shown that log (4) — log €(@9) is strictly increasing in Y when 6 > @ and 
strictly decreasing in Y when 6 < & (exercise). Hence, for any d € R, 
6 > 0 and £(09)/L(8) < cis equivalent to Y > d for some c € (0,1). 

(ii) The proof is similar to that in (i). Note that 


1 6 <6; or 6 > 
MX) = tnd fo) 6.<6<%. 


430 6. Hypothesis Tests 


Hence A(X) < c is equivalent to c) < Y < c. 
(iii) The proof for (iii) is left as an exercise. I 


Proposition 6.5 can be applied to problems concerning one-parameter 
exponential families such as the binomial, Poisson, negative binomial, and 
normal (with one parameter known) families. The following example shows 
that the same result holds in a situation where Proposition 6.5 is not ap- 
plicable. 


Example 6.20. Consider the testing problem Ho : 0 = 4 versus Hi : 64 
0) based on iid. X1,..., Xn from the uniform distribution U(0, 6). We now 
show that the UMP test with rejection region X(,) > 49 or X(n) < Moai!” 
given in Exercise 19(c) is an LR test. Note that (0) = 0-"I(x,,, 00) (9): 
Hence 

0 X(n) > Oo 


and A(X) < ¢ is equivalent to X(,) > 99 or X(n)/00 < cl/”, Taking c= a 
ensures that the LR test has sizea. UJ 


More examples of this kind can be found in §6.6. The next example 
considers multivariate 0. 


Example 6.21. Consider normal linear model (6.38) and the hypotheses 
in (6.49). The likelihood function in this problem is 


n/2 
€(8) = (gix)”!” exp {— be |X - 267}, 
where 6 = (3,07). Let 6 be the LSE defined by (3.27). Since ||X — Z@||? > 
|X — Z||? for any 8, 
n/2 4 
(0) < (shz)"” exp {she X - 28/7}. 


Treating the right-hand side of the previous expression as a function of o°, 
it is easy to show that it has a maximum at 0? = 6? = ||X — ZG||?/n and, 
therefore, 


sup €(0) = (2162)—"/27e-"/?, 
9€0 


Similarly, let 34;, be the LSE under Ho and 63, = |X — Z3y,||?/n. Then 
sup £(0) = (nag, Me , 
0E€Oo 

Thus, 


a n/2 -n/2 
= joy = (Waza) (2 4) 
MX) = (6° /F7, ) (SS era 


6.4. Tests in Parametric Models 431 


where W is given in (6.52). This shows that LR tests are the same as the 
UMPI tests derived in 86.3.2. 

The one-sample or two-sample two-sided t-tests derived in §6.2.3 are 
special cases of LR tests. For a one-sample problem, we define 6 = p and 
Z = Jn, the n-vector of ones. Note that 3 = X, 62 = (n—1)S?/n, Bir, =0 
(Ho : 8 = 0), and 63, = ||X||?/n = (n — 1)S?/n + X?. Hence 


where t(X) = /nX/S has the t-distribution t,_1 under Ho. Thus, \(X) < 
c is equivalent to |t(X)| > co, which is the rejection region of a one-sample 
two-sided t-test. 


For a two-sample problem, we let n = ny + no, 8 = (p11, He), and 


J, 0 
Z= te ; 
28 
Testing Ho : 41 = Le versus Hy : pt # pg is the same as testing (6.49) with 
£L=(1 -1). Since Gy, = X and 3 = (X1, X2), where X, and X2 are 


the sample means based on Xj,..., Xn, and Xn,41,..., Xn, respectively, we 
have 


i=l i=ni4+1 


and 


nowy, = (n —1)S? =n7!nyno(X1 — X2)? + (m1 — 1)S7 + (m2 — 1)S3. 


Therefore, A(X) < c is equivalent to |t(X)| > co, where t(X) is given by 
(6.37), and LR tests are the same as the two-sample two-sided t-tests in 
§6.2.3. 1 


6.4.2 Asymptotic tests based on likelihoods 


As we can see from Proposition 6.5 and the previous examples, an LR test 
is often equivalent to a test based on a statistic Y(X) whose distribution 
under Hp can be used to determine the rejection region of the LR test 
with size a. When this technique fails, it is difficult or even impossible to 
find an LR test with size a, even if the c.d.f. of A(X) is continuous. The 
following result shows that in the i.i.d. case we can obtain the asymptotic 
distribution (under Ho) of the likelihood ratio \(X) so that an LR test 


432 6. Hypothesis Tests 


having asymptotic significance level a can be obtained. Assume that Oo is 
determined by 
Ho: 0 = g(¥), (6.63) 


where ¥ is a (& — r)-vector of unknown parameters and g is a continuously 
differentiable function from R*-" to R* with a full rank Og(¥)/OV0. For 
example, if © = R? and Oo = {(01,02) € O : 61 = O}, then J = bo, 
gi(v) = 0, and go(¥) = v. 


Theorem 6.5. Assume the conditions in Theorem 4.16. Suppose that Ho 
is determined by (6.63). Under Ho, —2log An —a x2, where Ayn = A(X) 
and x? is a random variable having the chi-square distribution 2. Con- 
sequently, the LR test with rejection region An, < e~Xr/2 has asymptotic 
significance level a, where y?,, is the (1 — a)th quantile of the chi-square 
distribution y?. . 

Proof. Without loss of generality, we assume that there exist an MLE 0 
and an MLE # under Ho such that 


_ suppeo, (8) _ U(g(0)) 
supgce (9) £(8) 
Following the proof of Theorem 4.17 in §4.5.2, we can obtain that 
Vnl,(0)(6 — 0) =n-'/?58,,(0) + 0,(1), 


where s,(0) = Olog ¢(@)/00 and I,(@) is the Fisher information about 0 
contained in X,, and that 


2[log (8) — log ¢(6)] = n(6 — 6)" 1,(6)(6 — 8) + 0, (1). 
Then 
2[log €() — log £()] = n~*[5n(8)]” [1 (9)]~*5n(8) + op (1). 
Similarly, under Ho, 


2[log €(9(8)) — log €(9(9))] = n*[5n(9)]" 1) 5n (9) + op (1), 


where &,(0) = Alog &(g(7))/OV = D(V)sn(g(V)), DW) = Og(¥)/00, and 
I,(#) is the Fisher information about J (under Ho) contained in X,. Com- 
bining these results, we obtain that 


=n "[sa(g (9 yr B(O)sa(9(0)) + op(0) 


under Ho, where 


6.4. Tests in Parametric Models 433 


By the CLT, n~1/?[1,(0)|-1/2s,(0) +a Z, where Z = N;(0, Jy). Then, it 
follows from Theorem 1.10(iii) that, under Ho, 


—2log An a 27 [i (g(9))]/? B(O) [1 (g(8))]/7Z. 
Let D = D(0), B = B(Y), A= I1(g()), and C = (0). Then 


(Al? BAl?2)2 — 41/2 BABA? 
=AV(A- =D ODA AT eD CO DAY? 
= (I, — Al2p*C! pA?) (I, = Al/2p*C-1pAl/?) 
=i 2A VDOC DA CAA OrCAD eC DAY 
=p — AV/2ptc-! pai? 
= AV2B Al? 


where the fourth equality follows from the fact that C = DAD’. This 
shows that A!/?BA‘/? is a projection matrix. The rank of Al/?BA!/? is 


tr(A¥/? BA?) = tr(Iy — DTC-1DA) 
= k-tr(C7-'DAD’) 


= k—-tr(C“'C) 
=k—(k-—r) 


Thus, by Exercise 51 in §1.6, Z7[I,(g(¥))|'/?B(W)[h (g(0))//7Z = 2. 0 


As an example, Theorem 6.5 can be applied to testing problems in 
Example 4.33 where the exact rejection region of the LR test of size a is 
difficult to obtain but the likelihood ratio A, can be calculated numerically. 

Tests whose rejection regions are constructed using asymptotic theory 
(so that these tests have asymptotic significance level a) are called asymp- 
totic tests, which are useful when a test of exact size a is difficult to find. 
There are two popular asymptotic tests based on likelihoods that are asymp- 
totically equivalent to LR tests. Note that the hypothesis in (6.63) is equiv- 
alent to a set of r < k equations: 


Ho : R(0) =0, (6.64) 


where R(@) is a continuously differentiable function from R* to R”. Wald 
(1943) introduced a test that rejects Hp when the value of 


Wr = (RO) {[CO]’ Un) *C()}-* REO) 


is large, where C(0) = OR(0)/00, I,(@) is the Fisher information matrix 
based on Xj,..., Xn, and @ is an MLE or RLE of 0. For testing Ho : 6 = 6 


434 6. Hypothesis Tests 


with a known 00, R(@) = 6 — 4 and W,, simplifies to 
Wr = (8 — 00)” In(9) (8 — 60). 

Rao (1947) introduced a score test that rejects Hp when the value of 
Rn = [8n()]” Un(8)] sn (8) 


is large, where s,(0) = Olog ¢(0)/80 is the score function and @ is an MLE 
or RLE of # under Hp in (6.64). 


Theorem 6.6. Assume the conditions in Theorem 4.16. 

(i) Under Ho given by (6.64), W, a x2 and, therefore, the test rejects Ho 
if and only if W,, > x2, has asymptotic significance level a, where Xe is 
the (1 — a)th quantile of the chi-square distribution x2. 

(ii) The result in (i) still holds if W,, is replaced by Rp. 

Proof. (i) Using Theorems 1.12 and 4.17, 


Vn{R() — R(A)] a Nr (0, (C(O) [(@)]*C@)) , 


where [,(@) is the Fisher information about 6 contained in X;. Under Ho, 
R(0) = 0 and, therefore, 


n[ RB)" {[C(8)]" [a (@)]*C)} RO) a x7 


(Theorem 1.10). Then the result follows from Slutsky’s theorem (Theorem 
1.11) and the fact that 9, 6 and 1,(@) and C(@) are continuous at 6. 
(ii) From the Lagrange multiplier, 0 satisfies 

sn(9)+C(0)An=0 and R(d)=0. 
Using Taylor’s expansion, one can show (exercise) that under Ho, 


[C(A)|" (6 — 8) = 0,(n-”) (6.65) 


and 

8n(0) — In(0)(0 — 0) + C(O)An = 0p(n'/?), (6.66) 
where [,,(0) = nI,(9). Multiplying [C(0)]’ [J,(@)]~+ to the left-hand side of 
(6.66) and using (6.65), we obtain that 


[CA)]" En (8)]-*C@)An = —[C(8)]" Un (8)}-*8n (8) + op(n-"/?), (6.67) 
which implies 

MIC (9)]" En (8)]-* C(O) An a Xp (6.68) 

(exercise). Then the result follows from (6.68) and the fact that C(@)An = 


—8n(0), In(0) = nti (0), and 1,(0) is continuous at 0. I 


6.4. Tests in Parametric Models 435 


Thus, Wald’s tests, Rao’s score tests, and LR tests are asymptotically 
equivalent. Note that Wald’s test requires computing 6, not @ = (0), 
whereas Rao’s score test requires computing 6, not 6. On the other hand, 
an LR test requires computing both 6 and 6 (or solving two maximization 
problems). Hence, one may choose one of these tests that is easy to compute 
in a particular application. 

The results in Theorems 6.5 and 6.6 can be extended to non-i.i.d. sit- 
uations (e.g., the GLM in §4.4.2). We state without proof the following 
result. 


Theorem 6.7. Assume the conditions in Theorem 4.18. Consider the 
problem of testing Ho in (6.64) (or equivalently, (6.63)) with 0 = (6, @). 
Then the results in Theorems 6.5 and 6.6 still hold. J 


Example 6.22. Consider the GLM (4.55)-(4.58) with t,’s in a fixed interval 
(to, too), O< to < too < co. Then the Fisher information matrix 


Oat a” Gee) 


where M,,({) is given by (4.60) and J, (3, ¢) is the Fisher information about 
o. 

Consider the problem of testing Ho : 3 = So versus Hy : 8 4 Bo, where 
Go is a fixed vector. Then R(G,¢) = 8B — Go. Let (3,6) be the MLE (or 
RLE) of (6, ¢). Then, Wald’s test is based on 


Wr = 6 *(8 — Bo)” Mn(B)(8 — Bo) 
and Rao’s score test is based on 


Ryn = [Sn (Go)]" [Mn (Bo) * Sn (Go), 


where 8,,(() is given by (4.65) and ¢ is a solution of A log ¢(o, ¢)/A¢ = 0. 
It follows from Theorem 4.18 that both W, and R, are asymptotically 
distributed as xe under Ho. By Slutsky’s theorem, we may replace db or db 
by any consistent estimator of ¢@. I 


Wald’s tests, Rao’s score tests, and LR tests are typically consistent ac- 
cording to Definition 2.13(iii). They are also Chernoff-consistent (Definition 
2.13(iv)) if a is chosen to be an — 0 and x7, = o(m) as n — oo (exercise). 
Other asymptotic optimality properties of these tests are discussed in Wald 
(1943); see also Serfling (1980, Chapter 10). 


436 6. Hypothesis Tests 


6.4.3. y?-tests 


A test that is related to the asymptotic tests described in §6.4.2 is the 
so-called x?-test for testing cell probabilities in a multinomial distribu- 
tion. Consider a sequence of n independent trials with k possible out- 
comes for each trial. Let p; > 0 be the cell probability of occurrence of 
the jth outcome in any given trial and X; be the number of occurrences 
of the jth outcome in n trials. Then X = (Xj,...,X,) has the multino- 
mial distribution (Example 2.7) with the parameter p = (p1,...,px). Let 
& = (0,...,0,1,0,...,0), where the single nonzero component 1 is located in 
the jth position if the 7th trial yields the jth outcome. Then &,...,&, are 
iid. and X/n=&=)0_, &/n. By the CLT, 


Zn(p) = Vn (* — p) = Vn(E- p) a Nx (0,5), (6.69) 


where © = Var(X/,/n) is a symmetric k x k matrix whose ith diagonal 
element is p;(1 — p,;) and (2, 7)th off-diagonal element is —p,p,;. 
Consider the problem of testing 


Ho: P=Dpo versus Hi: pF Do, (6.70) 


where po = (poi; ---,Pok) is a known vector of cell probabilities. A popular 
test for (6.70) is based on the following ?-statistic: 


k 
= 57> Kaa mos)" = LD(p,)Znlpo)II (6.71) 


j=l sear 


where Z,,(p) is given by (6.69) and D(c) with c = (c,...,cz) is the k x k 
diagonal matrix whose jth diagonal element is a *. Another popular test 
is based on the following modified x?-statistic: 


k 
—Pos) 
=e cis — = ||D(X/n)Zn(Po) Il”. (6.72) 
j=l 
Note that X/n is an unbiased estimator of p. 


Theorem 6.8. Let ¢ = (,/pi,.-.,,/pe) and A be ak x k projection matrix. 
(i) If Ad = ad, then 


[Zn(P)]” D(p)AD(p)Zn(P) a x3, 


where x? has the chi-square distribution y? with r = tr(A) — a. 
(ii) The same result holds if D(p) in (i) is replaced by D(X/n). 


6.4. Tests in Parametric Models 437 


Proof. (i) Let D = D(p), Zn = Zn(p), and Z = N;,,(0,J,). From (6.69) 
and Theorem 1.10, 


ZIDADZn a ZAZ ~~ with A=Z'/?2DAD>DY?. 
From Exercise 51 in §1.6, the result in (i) follows if we can show that A? = A 
(i.e., A is a projection matrix) and r = tr(A). Since A is a projection matrix 
and A¢é = ad, a must be either 0 or 1. Note that DUD = I, — ¢¢7. Then 
A’ = D'/? DADUDADE DADS"? 

= 51/2 D(A — ab¢7)(A — abg7 ADE? 

= 0/2 D(A — 2ag¢7 + a26¢7 ADE"? 

= »/? D(A — add” ADD? 

= 2/2 DADEDADS'” 

— A? 
which implies that the eigenvalues of A must be 0 or 1. Therefore, A? = A. 


Also, 
tr(A) = tr[A(DuD)] = tr(A — a¢g¢") = tr(A) - a. 


(ii) The result in (ii) follows from the result in (i) and X/n—, p. I 


Note that the y?-statistic in (6.71) and the modified x?-statistic in (6.72) 
are special cases of the statistics in Theorem 6.8(i) and (ii), respectively, 
with A = J, satisfying Ad = ¢. Hence, a test of asymptotic significance 
level a for testing (6.70) rejects Ho when x? > XZ_1,.4 (Or X? > Xh-1.a)s 
where xz_1,q is the (1 — a)th quantile of xz_,. These tests are called 
(asymptotic) \?-tests. 


Example 6.23 (Goodness of fit tests). Let Yi,..., Yn be iid. from F’. 
Consider the problem of testing 


Hj): F=f versus AL: FHF Fo, (6.73) 


where Fo is a known c.d.f. For instance, Fy = N(0,1). One way to test 
(6.73) is to partition the range of Y; into & disjoint events Aj,...,A, and 
test (6.70) with Pj = Pr(A;) and Poj = Pr, (Aj), j = dL eevutkes Let X; be 
the number of Yj’s in Aj, j = 1,...,k. Based on X;’s, the ?-tests discussed 
previously can be applied to this problem and they are called goodness of 
fit tests. 1 


In the goodness of fit tests discussed in Example 6.23, Fo in Ho is known 
so that po;’s can be computed. In some cases, we need to test the following 
hypotheses that are slightly different from those in (6.73): 


Hj): F = Fo versus AL: FF Fo, (6.74) 


438 6. Hypothesis Tests 


where @ is an unknown parameter in © C R°. For example, Fy = N(, 07), 
6 = (1,07). If we still try to test (6.70) with p; = Pr,(A;), j =1,...,k, the 
result in Example 6.23 is not applicable since p is unknown under Ho. A 
generalized x?-test for (6.74) can be obtained using the following result. Let 
pP(@) = (p1(9),..., pe (9)) be a k-vector of known functions of @€ OC R*, 
where s < k. Consider the testing problem 


Ho : p= p(6) versus H,:p# p(6). (6.75) 


Note that (6.70) is the special case of (6.75) with s = 0, ie., 6 is known. 
Let @ be an MLE of @ under Ho. Then, by Theorem 6.5, the LR test that 
rejects Hp when —2log An > X%_<-1.9 has asymptotic significance level a, 
where yZ_,_1 is the (1—a)th quantile of x?_,_1 and 


K A 
an = [Pes /ers/ny®, 


Using the fact that p;(@)/(X;/n) —, 1 under Hp and 
log(1 + 2) = 2 — 27/2 + o(|z|”) as |x| — 0, 


we obtain that 


- p;(9) 
—2logAn = —2 > Xj log | 1+ =4 


a X;/n 
_-aSrx Ge ) : x, (40 : koa 
ETN XG/n ST \ X5/n - 
k A) )2 
25° [Xj = Or, anti) 


where the third equality follows from i Pi(9) - ey X;/n = 1. De- 
fine the generalized y?-statistics y? and ¥? to be the x? and X? in (6.71) 


and (6.72), respectively, with po;’s replaced by p,(@)’s. We then have the 
following result. 


Theorem 6.9. Under Ho given by (6.75), the generalized y?-statistics 
converge in distribution to y7_,_,. The x?-test with rejection region y? > 
Xk-s—ta (OV X° > Xk-s-1,0) has asymptotic significance level a, where 
X}_-s-1.9 18 the (1— a)th quantile of x?_, ,. I 


6.4. Tests in Parametric Models 439 


Theorem 6.9 can be applied to derive a goodness of fit test for hypotheses 
(6.74). However, one has to formulate (6.75) and compute an MLE of 0 
under Ho : p = p(@), which is different from an MLE under Ho : F = Fo 
unless (6.74) and (6.75) are the same; see Moore and Spruill (1975). The 
next example is the main application of Theorem 6.9. 


Example 6.24 (r x c contingency tables). The following r x c contingency 
table is a natural extension of the 2 x 2 contingency table considered in 
Example 6.12: 


By 

Bo 

B,. 
Total 


where A,’s are disjoint events with A, U---U A. = © (the sample space 
of a random experiment), B;’s are disjoint events with B, U---U B, =Q, 
and X;; is the observed frequency of the outcomes in A;M B;. Similar to 
the case of the 2 x 2 contingency table discussed in Example 6.12, there 
are two important applications in this problem. We first consider testing 
independence of {A;: j =1,...,c} and {B; :i=1,...,r} with hypotheses 


Ho: pij = pi-p.; for all i, 9 versus Hy: pij A pi-p.j for some 4, J, 


where pxj = P(A;N Bi) = E(Xi;)/n, pi. = P(Bi), and p.j = he: y 
i=1,..,r, 7 =1,..,¢. In this case, X = (Xj;,i = 1,.. ae 
has the falidoneal distribution with parameters pj;, 7 = 1,...,7, . 
1,...,¢. Under Ho, MLE’s of p;. and p.j are X;. = n;/n and oe — ili 
respectively, i = 1,...,r, 7 = 1,...,¢ (exercise). By Theorem 6.9, the x?-test 
rejects Hy when y? > Ane ties where 


2 (Xi — NXi.X.5)? 


and Xtr—1)(e-1),0 is the (1 — a)th quantile of the chi-square distribution 


yes 


Xfr—1)(e-1) (exercise). One can also obtain the modified y?-test by replacing 
nXj.X.; by X;,; in the denominator of each term of the sum in (6.76). 

Next, suppose that (X1,,...,X,;), 7 =1,...,¢, are c independent random 
vectors having the ‘niiltinonial diiebuions with parameters (p1;,...,Prj), 
j =1,...,¢, respectively. Consider the problem of testing whether c multi- 
nota ‘disntibations are the same, i.e., 


Ao: pij = pir for all i, 9 versus Hy: pij A pir for some 3, j. 


440 6. Hypothesis Tests 


It turns out that the rejection region of the y?-test given in Theorem 6.9 is 
still x? > X7._1)(¢-1),a With x? given by (6.76) (exercise). 

One can also obtain the LR test in this problem. When r = c = 2, the 
LR test is equivalent to Fisher’s exact test given in Example 6.12, which is 
a UMPU test. When r > 2 or c > 2, however, a UMPU test does not exist 
in this problem. Jf 


6.4.4 Bayes tests 


An LR test actually compares supgce, £(9) with supgee, (A) for testing 
(6.59). Instead of comparing two maximum values, one may compare two 
averages such as 7; = Jo, €(@)dI1()/ J, €(@)dIL(@), 7 = 0,1, where II(@) is 
ac.d.f. on O, and reject Hp when 7 > 7o. If I] is treated as a prior c.d.f., 
then 7; is the posterior probability of ©;, and this test is a particular Bayes 
action (see Exercise 18 in §4.6) and is called a Bayes test. 


In Bayesian analysis, one often considers the Bayes factor defined to be 


posterior odds ratio — 79/77 


ee 


prior odds ratio To/™° 


where 7; = II(©,) is the prior probability of O,. 
Clearly, if there is a statistic sufficient for 6, then the Bayes test and 
Bayes factor depend only on the sufficient statistic. 


Consider the special case where Op = {60} and 0; = {61} are simple 
hypotheses. For given X = 2, 


a: 1; fo, (2) 


So 
1 no foo (x) + 71 fo, (2) 
Rejecting Hp when 7, > 7to is the same as rejecting Hp when 
fo, (x) > AG. 
foo (x) ust 


This is equivalent to the UMP test T, in (6.3) (Theorem 6.1) with c = 19/m 
and y = 0. The Bayes factor in this case is 


(6.77) 


tom _ foo(@) 
mo ~— fo, (x) 


Thus, the UMP test T;. in (6.3) is equivalent to the test that rejects Ho 
when the Bayes factor is small. Note that the rejection region given by 
(6.77) depends on prior probabilities, whereas the Bayes factor does not. 

When either Oo or ©, is not simple, however, Bayes factors also depend 
on the prior II. 


6.4. Tests in Parametric Models 441 


If II is an improper prior, the Bayes test is still defined as long as the 
posterior probabilities 7; are finite. However, the Bayes factor may not be 
well defined when II is improper. 


Example 6.25. Let X1,...,X» be iid. from N(,07) with an unknown 
ui € R and a known o? > 0. Let the prior of be N(é,77). Then the 
posterior of yz is N(us(x),c?), where 


a? nt? 2 To? 


€ + ——— 2 and CS 


nT? + 0? nt? +0? 


bx (x) = 


nT? + 0? 
(see Example 2.25). Consider first the problem of testing Hp : uw < [Mo 
versus H) : uw > Uo. Let ® be the c.d-f. of the standard normal. Then the 
posterior probability of 09 and the Bayes factor are, respectively, 

(Hoes) p (SoHo) 


ito = @ ( Horuele ) and p= (2G Ho) @ (HOE) 


It is interesting to see that if we let 7 — oo, which is the same as considering 
the improper prior II = the Lebesgue measure on Fe, then 


ito > © (498), 


which is exactly the p-value &(x) derived in Example 2.29. 


Consider next the problem of testing Ho : 4 = pao versus Hy : wp F po. 
In this case the prior c.d.f. cannot be continuous at uo. We consider I(y) = 
ToL iyu,00) (H) + (1 — 70) ®(428). Let £(u) be the likelihood function based 
on z. Then 


1 EE 


mila) =f eas (#58) = oben (Tee), 


where ®’(t) is the p.d.f. of the standard normal distribution, and 


ey — Tol) (yy Leto 
a mof(t0) + (1 —to)mi(z) (1+ 


where 


is the Bayes factor. I 


More discussions about Bayesian hypothesis tests can be found in Berger 
(1985, §4.3.3). 


442 6. Hypothesis Tests 


6.5 Tests in Nonparametric Models 


In anonparametric problem, a UMP, UMPU, or UMPI test usually does not 
exist. In this section we study some nonparametric tests that have size a, 
limiting size a, or asymptotic significance level a. Consistency (Definition 
2.13) of these nonparametric tests is also discussed. 


Nonparametric tests are derived using some intuitively appealing ideas. 
They are commonly referred to as distribution-free tests, since almost no 
assumption is imposed on the population under consideration. But a non- 
parametric test may not be as good as a parametric test (in terms of its 
power) when the parametric model is correct. This is very similar to the 
case where we consider parametric estimation methods versus nonparamet- 
ric estimation methods. 


6.5.1 Sign, permutation, and rank tests 


Three popular classes of nonparametric tests are introduced here. The first 
one is the class of sign tests. Let X1,..., Xp be i.i.d. random variables from 
F, wu be a fixed constant, and p = F(u). Consider the problem of testing 
Ho: p< po versus H; : p > po, or testing Ho : p = po versus H, : p # po, 
where po is a fixed constant in (0,1). Let 


1 X;-u<0 
A; = ‘ << eee 
i X;-u> 0, : 


yn. 
Then Aj,...,A, are ii.d. binary random variables with p = P(A; = 1). 
For testing Ho : p < po versus Hy : p > po, it follows from Corollary 6.1 
that the test 


1 Y>m 
Ti(Y)=¢ ¥ Y=m (6.78) 
0 Y<m 


is of size a and UMP among tests based on Aj’s, where Y = )7"_, A; 
and m and y¥ satisfy (6.7). Although T; is of size a, we cannot conclude 
immediately that T;, is a UMP test, since Aj,..., A, may not be sufficient 
for F. However, it can be shown that T: is in fact a UMP test (Lehmann, 
1986, pp. 106-107) in this particular case. Note that no assumption is 
imposed on F’. 
For testing Ho : p = po versus H, : p € po, it follows from Theorem 6.4 

that the test 

1 Y<cqorY >c 

0 a<Y<e@ 


6.5. Tests in Nonparametric Models 443 


is of size a and UMP among unbiased tests based on A,;’s, where y and c¢;’s 
are chosen so that F(T.) = a and E(T.Y) = anpo when p = po. This test 
is in fact a UMPU test (Lehmann, 1986, p. 166). 


Since Y is equal to the number of nonnegative signs of (u — X;)’s, tests 
based on T;, in (6.78) or (6.79) are called sign tests. One can easily extend 
the sign tests to the case where p = P(X, € B) with any fixed event B. 
Another extension is to the case where we observe i.i.d. (X1, Yi), ..., (Xn, Yn) 
(matched pairs). By using A; = X; — Y; — u, one can obtain sign tests for 
hypotheses concerning P(X, — Y; < u). 

Next, we introduce the class of permutation tests. Let Xi1,...,Xin,, 
i = 1,2, be two independent samples i.i.d. from F;, 1 = 1,2, respectively, 
where F;’s are c.d.f.’s on R. In §6.2.3, we showed that the two-sample 
t-tests are UMPU tests for testing hypotheses concerning the means of 
F;,’s, under the assumption that F;’s are normal with the same variance. 
Such types of testing problems arise from the comparison of two treatments. 
Suppose now we remove the normality assumption and replace it by a much 
weaker assumption that F;,’s are in the nonparametric family F¥ containing 
all continuous c.d.f.’s on R. Consider the problem of testing 


Ho : Fy = Fy versus Ay : Fy # Fo, (6.80) 


which is the same as testing the equality of the means of F;’s when F;’s are 
normal with the same variance. 

Let X = (Xij,j = 1,...,m4,¢ = 1,2), n = n1 +e, and a be a given 
significance level. A test T'(X) satisfying 


a S> T(z)=a (6.81) 


* 2en(a) 


is called a permutation test, where m(x) is the set of n! points obtained 
from x € R” by permuting the components of x. Permutation tests are 
of size @ (exercise). Under the assumption that F\(a) = Fo(x — 6) and 
F, € F containing all c.d.f.’s having Lebesgue p.d.f.’s that are continuous 
a.e., which is still much weaker than the assumption that F;’s are normal 
with the same variance, the class of permutation tests of size a is exactly 
the same as the class of unbiased tests of size a; see, for example, Lehmann 
(1986, p. 231). 

Unfortunately, a test UMP among all permutation tests of size a does 


not exist. In applications, we usually choose a Lebesgue p.d.f. h and define 
a permutation test 


1 A(X) > hm 
T(X)=¢ ¥ h(X) =hm (6.82) 
0 h(X) < hn, 


444 6. Hypothesis Tests 


where h,,, is the (m + 1)th largest value of the set {h(z) : z € 1(x)}, m is 
the integer part of an!, and y = an!—m. This permutation test is optimal 
in some sense (Lehmann, 1986, §5.11). 


While the class of permutation tests is motivated by the unbiasedness 
principle, the third class of tests introduced here is motivated by the in- 
variance principle. 


Consider first the one-sample problem in which Xj,..., X,, are i.i.d. ran- 
dom variables from a continuous c.d.f. F’ and we would like to test 


Ho: F is symmetric about 0 versus Hy : F is not symmetric about 0. 


Let G be the class of transformations g(x) = (w(21),...,W(@n)), where ¢ is 
continuous, odd, and strictly increasing. Let R(X) be the vector of ranks of 
|X;|’s and R;,(X) (or R_(X)) be the subvector of R(X) containing ranks 
corresponding to positive (or negative) X;’s. It can be shown (exercise) that 
(Ri, R_) is maximal invariant under G. Furthermore, sufficiency permits 
a reduction from R, and R_ to R{, the vector of ordered components of 
R,. A test based on R&. is called a (one-sample) signed rank test. 


Similar to the case of permutation tests, there is no UMP test within 
the class of signed rank tests. A common choice is the signed rank test that 
rejects Ho when W(R°_) is too large or too small, where 


W(RS) = J(R64/n) +--+» + S(RL,, /n), (6.83) 


J is a continuous and strictly increasing function on [0,1], R%; is the ith 
component of R&, and n, is the number of positive X;’s. This is motivated 
by the fact that Ho is unlikely to be true if W in (6.83) is too large or too 
small. Note that W/n is equal to T(F),) with T given by (5.53) and J(t) = t, 
and the test based on W in (6.83) is the well-known one-sample Wilcoxon 
signed rank test. 

Under Ho, P(RG. = y) = 27” for each y € Y containing 2” n,-tuples 
Y = (Y1,-+5Yn,) satisfying 1 < yy < +++ < Yn, <n. Then, the following 
signed rank test is of size a: 


1 W(RY) < ci or W(RS.) > c2 
T(X)=4 y W(R{)=c,t=1,2 (6.84) 
0 a< W (R27) < C2, 


where c; and cg are the (m+ 1)th smallest and largest values of the set 
{W(y):y € VY}, m is the integer part of a2”/2, and y = a2"/2—m. 
Consider next the two-sample problem of testing (6.80) based on two 
independent samples, Xj1,...,Xin,;, 7 = 1,2, iid. from Fj, i = 1,2, respec- 
tively. Let G be the class of transformations g(x) = (W(ai;),j =1,...,ni,¢ = 


6.5. Tests in Nonparametric Models 445 


1,2), where 7 is continuous and strictly increasing. Let R(X) be the vec- 
tor of ranks of all X;;’s. In Example 6.14, we showed that R is maximal 
invariant under G. Again, sufficiency permits a reduction from R to R, 
the vector of ordered values of the ranks of Xy4,...,Xin,. A test for (6.80) 
based on R? is called a two-sample rank test. Under Ho, P(R? = y) = 
ay for each y € Y containing ea n,-tuples y = (y1,..-;Yn,) Satisfying 
l<y<--: < yn, <n. Let RZ = (Rf,..., Rf,,). Then a commonly 
used two-sample rank test is given by (6.83)-(6.84) with R¢;, ns, and 2” 
replaced by R{;, m1, and (ab respectively. When n1 = no, the statistic 
W/n is equal to T(F,) with T given by (5.55). When J(t) = t — 3, this 
reduces to the well-known two-sample Wilcoxon rank test. 

A common feature of the permutation and rank tests previously intro- 
duced is that tests of size a can be obtained for each fixed sample size n, but 
the computation involved in determining the rejection regions {T(X) = 1} 
may be cumbersome if n is large. Thus, one may consider approximations 
to permutation and rank tests when n is large. Permutation tests can of- 
ten be approximated by the two-sample t-tests derived in §6.2.3 (Lehmann, 
1986, §5.13). Using the results in §5.2.2, we now derive one-sample signed 
rank tests having limiting size a (Definition 2.13(ii)), which can be viewed 
as signed rank tests of size approximately a@ when n is large. 


From the discussion in §5.2.2, W/n = T(F;,) with a 0..-Hadamard dif- 
ferentiable functional T given by (5.53) and, by Theorem 5.5, 
Va[W/n — T(F)] >a N(0,0%), 
where 0% = Elor(X1)]’, 


or (x) = i J'(F(y))(@x — F)(y)aF (y) + J(E(a)) — 1(F) 


(see (5.54)), and 6, denotes the c.d.f. degenerated at x. Since F is contin- 
uous, F(x) = F(x) — F(—2x). Under Ho, F(x) = 1— F(—2). Hence, o7, 
under Ho is equal to v; + v2 + 2v12, where 


vy = Var(J(F(X1))) = 5 [ veer) 


w= Ver( | ” (FQ) Gx, - Py(udF tw) 


NLR Ble 


ie is J'(F(y))J' (F(z) [F (min{y, 2}) — F(y)F(2)|dF (y)dF(z) 


[. or J'(F(y))J'(F(2))F (2 — Fy) dF y)aF (2), 


446 6. Hypothesis Tests 


and 
jeteee (44). | ” 7'(Fy)) x, - FY) 


=F / ” (F(X) I EW) Gx, — Pyar) 


PAGO 5 POSE «ais Bis 7 : ; 


Note that under Ho, the distribution of W is completely known. Indeed, 
letting s = F(y) and t = F(z), we conclude that 0%, = vy + v2 + 2u12 and 


Co 2 1 1 
T(F) =| J(P(ax))dF (ax) = >| J(s)ds 
0 0 
do not depend on F’. Hence, a signed rank test T that rejects Hp when 


Vn|W/n — to| > 021-a/2; (6.85) 


where z, = ®~'(a) and to = T(F) and of = o% under Ho are known 
constants, has the property that 


sup Br(P) = SEP P (J/n|W/n - to| > 7021—a/2) 
€Po 


PEPo 
Pw (Vn|W/n — to] > o021~a/2) 


> a, 


i.e., JT has limiting size a. 
Two-sample rank tests having limiting size a can be similarly derived 
(exercise). 


6.5.2 Kolmogorov-Smirnov and Cramér-von Mises tests 


In this section we introduce two types of tests for hypotheses concerning 
continuous c.d.f.’s on R. Let Xy,...,X, be iid. random variables from a 
continuous c.d.f. F. Suppose that we would like to test hypotheses (6.73), 
ie., Ho : F = Fo versus H, : F 4 Fo with a fixed Fo. Let F, be the 
empirical c.d.f. and 


D,(F’) = sup |Fn (x) — F(x)|, (6.86) 


which is in fact the distance 0..(Fn, F’). Intuitively, D,,(Fo) should be small 
if Ho is true. From the results in §5.1.1, we know that D,(Fo) a.s. 0 if and 


6.5. Tests in Nonparametric Models 447 


only if Ho is true. The statistic D, (Fo) is called the Kolmogorov-Smirnov 
statistic. Tests with rejection region D,(Fo) > ¢ are called Kolmogorov- 
Smirnov tests. 


In some cases we would like to test “one-sided” hypotheses Hg : F' = Fo 
versus H,: F > ho, F # Fo, or Ho: F = Fo versus Hy: F< Fo, FF Fo. 
The corresponding Kolmogorov-Smirnov statistic is D;*(Fo) or D7 (Fo), 
where 

D3(F) = suplFa(#) ~ F(x) (6.87) 
rER 
and 
D, (F) = sup[F (x) — F,(2)). 
LER 
The rejection regions of one-sided Kolmogorov-Smirnov tests are, respec- 
tively, Dj (Fo) > ¢ and Dz (Fo) > c. 

Let X(1) < +++ < X(n) be the order statistics and define X(9) = —oo and 

X(n41) = o. Since F(x) =1/n when XQ) < ae < XG41),7=0,1,...,n, 


Deh eae: sap [= - Fe) 


OStS" XG) <@<X 41) LM 
a ; 
= max |—— inf F(a) 
O<i<n}]n X (i) SU@<X G41) 
a 
= max — — F(X) é 
O0<i<n}1n 


When F is continuous, F(X (;)) is the ith order statistic of a sample of size n 
from the uniform distribution U(0,1) irrespective of what F is. Therefore, 
the distribution of D; (F) does not depend on F, if we restrict our attention 
to continuous c.d.f.’s on R. The distribution of DZ (F) is the same as that 
of D(F) because of symmetry (exercise). Since 


D,,(F) = max{D; (F), D, (F)}, 


the distribution of D,(f) does not depend on F. This means that the 
distributions of Kolmogorov-Smirnov statistics are known under Ho. 


Theorem 6.10. Let D,(F) and D*(F) be defined by (6.86) and (6.87), 
respectively, for a continuous c.d.f. F on R. 
(i) For any fixed n, 


0 t<0 
nm Un—-i+2 
P(D#(F) <t) = nT] | duy+++dun  O<t<1 


i-1 max{0,#=+t* _¢} 


1 t>1 


448 6. Hypothesis Tests 


and 
1 
0 t<55 
P(D,(F) <t) = n! / du,---dun i<t<l 
( ) II max{0,7 itty} an 
1 t>1 


where Un+1 = l. 
(ii) For t > 0, 
lim P(/nDt(F) <t) =1- 67 


and 


lim P(/nD,(F) “1-201 Lite 2g 


The proof of Theorem 6.10(i) is left as an exercise. The proof of Theorem 
6.10(ii) can be found in Kolmogorov (1933) and Smirnov (1944). 

When n is not large, Kolmogorov-Smirnov tests of size a can be obtained 
using the results in Theorem 6.10(i). When n is large, using the results in 
Theorem 6.10(i) is not convenient. We can obtain Kolmogorov-Smirnov 
tests of limiting size a using the results in Theorem 6.10(ii). 

Another test for Ho : F = Fo versus Hy : F # Fo is the Cramér-von 
Mises test, which rejects Hp when C;,(Fo) > c, where 


Cu(P) = [ LFa(e) - Fa)/PaF(e) (6.88) 


is another measure of disparity between F,, and F’. Similar to D,(F’), the 
distribution of C;,(’) does not depend on F' (exercise). Hence, a Cramér- 
von Mises test of size a can be obtained. When n is large, it is more 
convenient to use a Cramér-von Mises test of limiting size a. Note that 
C;,(Fo) is actually a V-statistic (§3.5.3) with kernel 


h(@1,@2) = [lea — Fo(y)|[bx2(y) — Fo(y)|4Fo(y) 


and 


hi(ai) = Elh(a1, X2)] = ox — Fo(y|lF(y) — Fo(y)|4Fo(y), 


where 4, denotes the c.d.f. degenerated at x. It follows from Theorem 3.16 
that if H, is true, C,,(Fo) is asymptotically normal, whereas if Ho is true, 
hi(a1) = 0 and 


nCn (Fo) a > Agxiys 


j=l 


6.5. Tests in Nonparametric Models 449 


where y7,’s are i.i.d. from the chi-square distribution yj and ),’s are con- 
stants. In this case, Durbin (1973) showed that \; = j~?77?. 


For testing (6.73), it is worthwhile to compare the goodness of fit test 
introduced in Example 6.23 with the Kolmogorov-Smirnov test (or Cramér- 
von Mises test). The former requires a partition of the range of observations 
and may lose information through partitioning, whereas the latter requires 
that F be continuous and univariate; the latter is of size a (or limiting size 
a), whereas the former is only of asymptotic significance level a; and the 
former can be modified to allow estimation of unknown parameters under 
Hp (i.e., hypotheses (6.74)), whereas the latter does not have this flexibility. 
Note that goodness of fit tests are nonparametric in nature, although y°?- 
tests are derived from a parametric model. 


Kolmogoroy-Smirnov tests can be extended to two-sample problems to 
test hypotheses in (6.80). Let Xj,...,Xin,;, 7 = 1,2, be two indepen- 
dent samples iid. from F; on R, i = 1,2, and let Fj,, be the empirical 
c.d.f. based on Xj1,..., Xin,. A Kolmogorov-Smirnov test rejects Ho when 
Dnyjng > €, where 


Divichs = sup [Fin (x) = Fons (x)|. 
rER 


A Kolmogorov-Smirnov test of limiting size a can be obtained using 


Co 


lim  P(/nin2/(n1 + n2)Dnino < t) = ye (ayer t>0. 
n1,N2—00 
j=-o 


6.5.3 Empirical likelihood ratio tests 


The method of likelihood ratio is useful in deriving tests under parametric 
models. In nonparametric problems, we now introduce a similar method 
based on the empirical likelihoods introduced in §5.1.2 and 85.1.4. 


Suppose that a sample X is from a population determined by a c.d-f. 
F ¢ , where F is a class of ¢.d.f.’s on R%. Consider the problem of testing 
Ho : T(E) = to versus A, : T(E) F to, (6.89) 


where T is a functional from ¥ to R* and tg is a fixed vector in R”. Let 
£(G), G € F, be a given empirical likelihood, F’ be an MELE of F, and 
Fy, be an MELE of F under Ho, ie., Fy, is an MELE of F subject to 
T(£’) = to. Then the empirical likelihood ratio is defined as 


An(X) = (Fa) /C(F). 


A test with rejection region A,,(X) < cis called an empirical likelihood ratio 
test. 


450 6. Hypothesis Tests 


As a specific example, consider the following empirical likelihood (or 
nonparametric likelihood) when X = (Xj,..., Xn) with iid. X;’s: 


£(G) = [> subject to pi = 0, bie =1, 
i=1 i=1 
where p; = Pe({xi}), i =1,...,n. Suppose that T(G) = f u(x)dG(x) with 
a known function u(a) from R? to R’. Then F = F,; Ho in (6.89) with 
to = 0 is the same as the case where assumption (5.9) holds; Fy, is the 
MELE given by (5.11); and the empirical likelihood ratio is 


An(X) = n [Ta (6.90) 


where p; is given by (5.12). An empirical likelihood ratio test with asymp- 
totic significance level a can be obtained using the following result. 


Theorem 6.11. Assume the conditions in Theorem 5.4. Under the hy- 
pothesis Hp in (6.89) with to = 0 (ie., (5.9) holds), 


—2 log An ~d pee 


where Ap, = An(X) is given by (6.90) and y? has the chi-square distribution 
2 
x, &- 


The proof of this result can be found in Owen (1988, 1990). In fact, 
the result in Theorem 6.11 holds for some other functionals T such as the 
median functional. 

We can also derive tests based on the profile empirical likelihoods dis- 
cussed in §5.4.1. Consider an empirical likelihood 


&(G) = [> subject to p; > 0, Spi =1, S- pith(2i, 9) = 0, 
i=1 i=1 i=1 


where @ is a k-vector of unknown parameters and w is a known function. 
Let 6 = (0, y), where VJ is an r-vector and y is a (k — r)-vector. Suppose 
that we would like to test 


Ap: 0=%o versus HW, : 0A, 


where Uo is a fixed r-vector. Let 6 be a maximum of the profile empiri- 
cal likelihood ¢p(@) given by (5.36) and let @ be a maximum of ¢p(y) = 
£p(¥o0,y). Then a profile empirical likelihood ratio test rejects Hp when 
An(X) <c, where 


1+ (nO), 8) (6.91) 


pomeann SM AC 
Ae [Cn (Vo, PI" v(x; Vo, ~) 


6.5. Tests in Nonparametric Models 451 


6 and ¢ are maximum profile empirical likelihood estimators, €,(0) satisfies 


jai 1 + [&n(8)]7 (zi, 8) 
and ¢,(¥0,¢) satisfies 
Ss (zi, Vo, ) =(0 
j=1 1 + [Cn( Vo, ~)]" W(xi, 90, 9) 
From the discussion in §5.4.1, 0 is a solution of the GEE x VE BHO 
when the dimension of w is k. Under some regularity conditions (e.g., the 
conditions in Proposition 5.3), Qin and Lawless (1994) showed that the 
result in Theorem 6.11 holds with A,,(X) given by (6.91). Thus, a profile 
empirical likelihood ratio test with asymptotic significance level a can be 
obtained. 


Example 6.26. Let Y,,..., Y, be i.i.d. random 2-vectors from F’. Consider 
the problem of testing Ho : 1 = 2 versus Hy : p41 A pe, where ({11, fe) = 
E(Y). Let Y¥; = (Yin, Vie), Xa = Yu — Yi2, Xig2 = Yu + Yio, and X; = 
(Xi1, Xi2), 1= 1, soog TD. Then Xi, uagpAn are iid. with E(X1) = 9= (0, ~); 
where J? = 1 — Mg and y = (41 + pa. The hypotheses of interest becomes 
Hy: 0 =0 versus H,: 040. 

To apply the profile empirical likelihood method, we define (x, @) = 
z—0, x € R*. Note that a solution of the GEE )7"_,(X; — 6) = 0 is the 


sample mean 6 = X. The profile empirical likelihood ratio is then given by 


1+ fen) — X) 
(*) = TTC. =O 


where €,,(X), Cn(0, 2), and ¢ satisfy 
“ X,-X 
—— a. 
d 1+ [&n(X)]7 (Xi — X) 
: = (0,8) = 
Lise * 
and £p(0,¢) = maxy ¢p(0, with 


1 


60.0) = Larrea O aT 


Empirical likelihood ratio tests or profile empirical likelihood ratio tests 
in various other problems can be found, for example, in Owen (1988, 1990, 
2001), Chen and Qin (1993), Qin (1993), and Qin and Lawless (1994). 


452 6. Hypothesis Tests 


6.5.4 Asymptotic tests 


We now introduce a simple method of constructing asymptotic tests (i.e., 
tests with asymptotic significance level a). This method works for almost 
all problems (parametric or nonparametric) in which the hypotheses being 
tested are Ho : 0 = 09 versus H, : 8 4 8, where @ is a vector of parameters, 
and an asymptotically normally distributed estimator of @ can be found. 
However, this simple method may not provide the best or even nearly best 
solution to the problem, especially when there are different asymptotically 
normally distributed estimators of 6. 


Let X be a sample of size n from a population P and 6, be an estimator 
of 6, a k-vector of parameters related to P. Suppose that under Ho, 


V;, 1/2 (6, — 8) a Nx (0, In), (6.92) 


where V,, is the asymptotic covariance matrix of 6. If V, is known when 
# = 6, then a test with rejection region 


(6n — 80)" Vj 1(On — 90) > XRa (6.93) 


has asymptotic significance level a, where Xia is the (1 — a)th quantile of 


the chi-squared distribution Sth If the distribution of bn does not depend 
on the unknown population P under Hp and (6.92) holds, then a test with 
rejection region (6.93) has limiting size a. 

If V,, in (6.93) depends on the unknown population P even if Ho is true 
(9 = 09), then we have to replace V, in (6.93) by an estimator V,. If, 
under Ho, Vin is consistent according to Definition 5.4, then the test having 
rejection region (6.93) with V, replaced by V, has asymptotic significance 
level a. Variance estimation methods introduced in §5.5 can be used to 
construct a consistent estimator Ve. 


In some cases result (6.92) holds for any P. Then, the following result 
shows that the test having rejection region (6.93) is asymptotically correct 
(§2.5.3), ie., it is a consistent asymptotic test (Definition 2.13). 


Theorem 6.12. Assume that (6.92) holds for any P and that A;[V,] — 0, 
where A+[V,] is the largest eigenvalue of V,,. 

(i) The test having rejection region (6.93) (with a known V,, or V,, replaced 
by an estimator Vn that is consistent for any P) is consistent. 

(ii) If we choose a = an > 0 as n > co and XZ 44, A+[Vn] = o(1), then 
the test in (i) is Chernoff-consistent. 

Proof. The proof of (ii) is left as an exercise. We only prove (i) for the 
case where V,, is known. Let Zn = Vn /?(6n — 0) and In = Va /?(0 — Oo). 
Then ||Zp|| = Op(1) and |[In|| = ||Vn '/?(0 — 05)|| > co when 0 # 0. The 


6.5. Tests in Nonparametric Models 453 


result follows from the fact that when 6 4 4, 
(Gn = 90)" Vir (On zs 90) = || Zn||? oF I[dn|I? + 21, Zn 
> [Znl|? + [eal]? — 2llenllll Zall 
= O/(1) + I[ln||?[1 — op(1)] 


and, therefore, 


e (Gn — 60)" Vi- (6n — 80) > Xka) vias El 


Example 6.27. Let X1,..., Xn bei.i.d. random variables from a symmetric 
c.d.f. F having finite variance and positive F’. Consider the problem of 
testing Hp : F is symmetric about 0 versus H, : F is not symmetric about 
0. Under Ho, there are many estimators satisfying (6.92). We consider the 
following five estimators: 

(1) On = = X and 6 = E(X}); 

(2) On = 40.5 (the sample median) and 6 = F~1(4) (the median of F); 

(3) 6, = Xq (the a-trimmed sample mean defined by (5.77)) and 6 = T(F), 

where T is given by (5.46) with J(t) = (1 — 2a)~'I(q1-a)(t), a € o 4); 

(4) 6, = the pode Lehmann estimator (Example 5.8) and 6 = F~1(4); 
(5) On = W/n- +, where W is given by (6.83) with J(t) = t, and 0 = 
T(F) — 5 with T pie by (5.53). 

Although the 6’s in (1)-(5) are different in general, in all cases 6 = 0 is 
equivalent to that Ho holds. 

For X, it follows from the CLT that (6.92) holds with V,, = 07/n for any 
F, where o? = Var(X1). From the SLLN, $?/n is a consistent estimator of 
V, for any F. Thus, the test having rejection region (6.93) with 6,, = X and 
V, replaced by S?/n is asymptotically correct. This test is asymptotically 
equivalent to the one-sample t-test derived in §6.2.3. 

From Theorem 5.10, 60,5 satisfies (6.92) with V, = 471[F’(0)|-2n7! for 
any F’. A consistent estimator of V, can be obtained using the bootstrap 
method considered in §5.5.3. Another consistent estimator of V, can be 
obtained using Woodruff’s interval introduced in §7.4 (see Exercise 86 in 
87.6). The test having rejection region (6.93) with 6, = 49.5 and V;, replaced 
by a consistent estimator is asymptotically correct. 

It follows from the discussion in §5.3.2 that Xq satisfies (6.92) for any 
F. A consistent estimator of V,, can be obtained using formula (5.110) 
or the jackknife method in §5.5.2. The test having rejection region (6.93) 
with 6, = X, and Vp replaced by a consistent estimator is asymptotically 
correct. 

From Example 5.8, the Hone. Lehmann estimator satisfies (6.92) for 
any F and V, = 12~'y~?n71 under Ho, where y = f F’(x)dF(x). A 


454 6. Hypothesis Tests 


consistent estimator of V, under Ho can be obtained using the result in 
Exercise 102 in §5.6. The test having rejection region (6.93) with 0, = the 
Hodges-Lehmann estimator and V,, replaced by a consistent estimator is 
asymptotically correct. 


Note that all tests discussed so far are not of limiting size a, since the 
distributions of 6,, are still unknown under Ao. 

The test having rejection region (6.93) with 6, = W/n— 4 and V, = 
(12n)~! is equivalent to the one-sample Wilcoxon signed rank test and is 
shown to have limiting size a (§6.5.1). Also, (6.92) is satisfied for any F 
(§5.2.2). Although Theorem 6.12 is not applicable, a modified proof of 
Theorem 6.12 can be used to show the consistency of this test (exercise). 


It is not clear which one of the five tests discussed here is to be preferred 
in general. 


The results for 9, in (1)-(3) and (5) still hold for testing Ho : 8 = 0 
versus H, : 6 #0 without the assumption that F' is symmetric. I 


An example of asymptotic tests for one-sided hypotheses is given in 
Exercise 123. Most tests in §6.1-§6.4 derived under parametric models are 
asymptotically correct even when the parametric model assumptions are 
removed. Some examples are given in Exercises 121-123. 


Finally, a study of asymptotic efficiencies of various tests can be found, 
for example, in Serfling (1980, Chapter 10). 


6.6 Exercises 


1. Prove Theorem 6.1 for the case of ~@ = 0 or 1. 


2. Assume the conditions in Theorem 6.1. Let 3(P) be the power func- 
tion of a UMP test of size a € (0,1). Show that a < 6(P,) unless 
Pia Pe 


3. Let T, be given by (6.3) with c= c(a) for ana > 0. 
(a) Show that if a1 < ag, then c(a1) > c(az). 
(b) Show that if a1 < ag, then the type II error probability of T,. of 
size a, is larger than that of T;, of size ag. 


4. Let Ho and H; be simple and let a € (0,1). Suppose that T, is a 
UMP test of size a for testing Hp versus Hy and that 3 < 1, where @ 
is the power of T, when Hj is true. Show that 1 — T, is a UMP test 
of size 1 — (@ for testing Hy versus Ho. 


5. Let X be a sample of size 1 from a Lebesgue p.d.f. fg. Find a UMP 
test of size a € (0,4) for Ho : 6 = 4 versus H, : 6 = 6; when 


6.6. Exercises 455 


10. 


11. 
12. 


13. 
14. 


) fo(z) = 2077(0 — x)I(o,) (x), 90 < 91; 

) fo(x) = 2/2 + (1— @)(1 — x)|I(o,1)(@), OS 1 < 0 <1; 

c) fo, is the p.d.f. of N(0,1) and fo, is the p.d.f. of the Cauchy 
distribution C(0, 1); 

(d) 

(e) 

(f 


(a 
(b 
( 
foo(@) = 4xo,2)(a) + 41 — w)a1)(@) and fo, (x) = Koa) (@); 


fo is the Hs df. of the Cauchy eae C(6,1) and 6; = 1; 
) fao(e) = e-* 10,00) (x) and fo, (x) = 2- "a? e-*10,00) (2). 


. Let Xy,..., Xn be ii.d. from a Lebesgue p.d.f. fg. Find a UMP test 


of size a for Hp : @ = 0 versus Hy : 0 = 6; in the following cases: 
(a) fo(x) = eT e,00) (2), 9 < 1; 
(b) fo(z) = 027 T(6,00) (2), Ao # 01. 


. Prove Proposition 6.1. 


. Let X € R” be a sample with a p.d.f. f w.r.t. a o-finite measure v. 


Consider the problem of testing Ho : f = fo versus Hi : f = g, where 
0 € O, fe(x) is Borel on (R” x O,c0(B” x F)), and (0,7, A) isa 
probability space. Let c > 0 be a constant and 


af 4d ee fo(x)dA 
6.(a) = { 0 Nee He 


Suppose that [ x(x) fo(x)dv = a )fo(x)dv = a for any 
6 € ©’ with A(O’) = 1. Show that ¢, is a UMP test of size a. 


. Let fo and fi be Lebesgue integrable functions on R and ¢, be the 


indicator function of the set a Fs <O}U{a: fo(x)=0, fi(x) > 0}. 
Show that ¢, maximizes [ (x) fi (x ve over all Borel ae g~ on 
R satisfying 0 < ¢(x) < 1 and ioe x)dx = { ¢,(x) fo(x)dz. 


Let F, and F be two c.d.f.’s on R. Show that Fi (x) < mee for all 
x if and only if [ g(a)dF2(x) < f 9(x)dF\(x) for any nondecreasing 
function g. 


Prove the claims in Example 6.5. 


Show that the family {fo : 9 € R} has monotone likelihood ratio, 
where fo(x) = c(@)h(x)I(aco),n(0))(@), hb is a positive Lebesgue inte- 
grable function, and a and 6 are nondecreasing functions of 6. 


Prove part (iv) and part (v) of Theorem 6.2. 


Let Xj,...,X, be iid. from a Lebesgue p.d.f. fg,9€ OCR. Finda 
UMP test of size a for testing Hp : 6 < 69 versus H, : 6 > 69 when 
(a ) fo(x )=o- : eT jay (at 8 > 0} 

(b) fo(e) = 0-'a”“"To,1)(@), 8 > 0; 

(c) fo(x) ania of N(1, 6); 

(d) fe(a) = to te— (2/9) Ty ..)(x), 8 > 0, where c > 0 is known. 


456 


15. 


16. 


17. 


18. 


19. 


6. Hypothesis Tests 


Suppose that the distribution of X is in a family with monotone 
likelihood ratio in Y(X), where Y(X) has a continuous distribution. 
Consider the hypotheses Ho : 8 < 09 versus Hy, : 8 > 09. Show that 
the p-value (§2.4.2) of the UMP test is given by Po,(Y > y), where y 
is the observed value of Y. 


Let X1,..., Xm be iid. from N(1z,02) and Yj,..., Yn be iid. from 
N (by, 0%). Suppose that X;’s and Y;’s are independent. 

(a) When o, = o, = 1, find a UMP test of size a for testing Ho : 
Hla < fy versus H) : Uz > fy. (Hint: see Lehmann (1986, §3.9).) 

(b) When zz and jy, are known, find a UMP test of size a for testing 
Ho : 0, < oy versus H) : 0, > oy. (Hint: see Lehmann (1986, §3.9).) 


Let F and G be two known c.d.f.’s on R and X be a single observation 
from the c.d.f. OF (x) + (1 — 6)G(«x), where 6 € [0,1] is unknown. 

(a) Find a UMP test of size a for testing Ho: 0 < 4 versus Hi: 6 > Oo. 
(b) Show that the test T,.(X) = a is a UMP test of size a for testing 
Ho: 8 <4, or 6 > 65 versus Hy : 0, <@ < Oo. 


Let X1,..., Xp be iid. from the uniform distribution U(0,6+1), 6 € 
R. Suppose that n > 2. 

(a) Find the joint distribution of X(1) and X(n). 

(b) Show that a UMP test of size a for testing Hp : 0 < 0 versus 
Ay, :0> 0 is of the form 


0 X(1) <1- al/n, X(n) <1 
T.. xX , xX n ] 
( (1)> <4¢ )) { 1 otherwise. 


(c) Does the family of all possible distributions of (X(1),X(n)) have 
monotone likelihood ratio? (Hint: see Lehmann (1986, p. 115).) 


Suppose that Xj,...,X, are i.i.d. from the discrete uniform distribu- 
tion DU(1,...,9) (Table 1.1, page 18) with an unknown 6 = 1, 2,.... 
(a) Consider Ho : 8 < 09 versus Hy : 0 > 69. Show that 


1 Xn) > 6 
is a UMP test of size a. 
(b) Consider Ho : 0 = 9 versus H, : 6 4 09. Show that 


1 Xn) > Oo or Xn) < Agal/” 
otherwise 
is a UMP test of size a. 


(c) Show that the results in (a) and (b) still hold if the discrete uniform 
distribution is replaced by the uniform distribution U(0, 6), 6 > 0. 


6.6. Exercises 457 


20. 


21. 


22. 


23. 


24. 


25. 


26. 


Let X1,..., Xp be iid. from the exponential distribution E(a, 0), a € 
R,0>0. 

(a) Derive a UMP test of size a for testing Ho : a = ag versus Hy, : 
a # do, when @ is known. 

(b) For testing Ho : a = ap versus Hi : a = a, < ao, show that any 
UMP test T, of size a satisfies Bp, (a1) = 1 — (1 — a)eW(20-4)/8, 
(c) For testing Hp : a = ap versus H; : a = a, < ao, show that the 
power of any size a test that rejects Hy when Y < c; or Y > c2 is the 
same as that in part (b), where Y = (X() — ao)/ 0}, (Xi — Xa). 
(d) Derive a UMP test of size a for testing Ho : a = ao versus 
Ay, :a#ao. 

(e) Derive a UMP test of size a for testing Hp : 0 = 09, a = ao versus 
Ay :0<09,a < ap. 


Let X4,...,Xn be iid. from the Pareto distribution Pa(a,@), 6 > 0, 
a> 0. 

(a) Derive a UMP test of size a for testing Ho : a = ao versus Hy : 
a # dao when @ is known. 

(b) Derive a UMP test of size a for testing Hp : a = ap, 0 = 9 versus 
Ay, :0>09,a < ao. 


In Exercise 19(a) of §3.6, derive a UMP test of size a € (0,1) for 
testing Hp : 6 < 6 versus H, : 8 > 6, where 6 is known and 
A) > (1 _ ayo, 


In Exercise 55 of §2.6, derive a UMP test of size a for testing 
Ho : 8 > @ versus H; : 0 < @ based on data X1,...,Xn, where 
8) > 0 is a fixed value. 


Prove part (ii) of Theorem 6.3. 


Consider Example 6.10. Suppose that 62 = —0@,. Show that co = —c1 
and discuss how to find the value of co. 


Suppose that the distribution of X is in a family of p.d.f.’s indexed 
by a real-valued parameter 6; there is a real-valued sufficient statistic 
U(X) such that fo,(u)/fo,(u) is strictly increasing in wu for 01 < 62, 
where fg(u) is the Lebesgue p.d.f. of U(X) and is continuous in u for 
each 9; and that for all 0; < 02 < 03 and u, < ug < us, 


fo, (ur) fo, (u2) fo, (us) 
fo. (ur) fo. (u2) fo. (us) > 0. 
fo,(u1) fo, (u2) fo; (us) 


Show that the conclusions of Theorem 6.3 remain valid. 


458 


27. 


28. 


29. 


30. 


31. 


32. 


33. 


6. Hypothesis Tests 


(p-values). Suppose that X has a distribution Py, where 0 € FR is 
unknown. Consider a family of nonrandomized level a tests for Ho : 
6 = 6 (or 0 < Op) with rejection region Cg such that Pp,(X € Ca) = 
a for allO <a <1 and Ca, =Nasa,Coa for all 0 < ay < 1. 

(a) Show that the p-value is @(#) = inf{a: x € Cy}. 

(b) Show that when @ = 69, &(X) has the uniform distribution U(0, 1). 
(c) If the tests with rejection regions Cy are unbiased of level a, show 
that under Hy, Po(@(X) < a) >a. 


Suppose that X has the p.d.f. (6.10). Consider hypotheses (6.13) or 
(6.14). Show that a UMP test does not exist. (Hint: this follows 
from a consideration of the UMP tests for the one-sided hypotheses 
Ho :6> 6, and Ho : 0 < 6.) 


Consider Exercise 17 with Hp : 6 € [01,02] versus Hi : 6 ¢ [04,09], 
where 0 < 0, < 62 < 1. 

(a) Show that a UMP test does not exist. 

(b) Obtain a UMPU test of size a. 


In the proof of Theorem 6.4, show that 

(a) (6.30) is equivalent to (6.31); 

(b) (6.31) is equivalent to (6.29) with T, replaced by T 

(c) when 0 < a < 1, (a,aE@,(Y)) is an interior point of the set 
of points (E9,(T), Bo,(LY)) as T ranges over all tests of the form 
T =T(Y); 

(d) the UMPU tests are unique a.s. P if attention is restricted to tests 
depending on (Y,U) and (Y,U) has a continuous c.d.f. 


Consider the decision problem in Example 2.20 with the 0-1 loss. 
Show that if a UMPU test of size a exists and is unique (in the sense 
that decision rules that are equivalent in terms of the risk are treated 
the same), then it is admissible. 


Let X1,..., Xp be ii.d. binary random variables with p = P(X; = 1). 
(a) Determine the c;’s and 7;’s in (6.15) and (6.16) for testing 
Ao : p < 0.2 or p > 0.7 when a = 0.1 and n = 15. Find the 
power of the UMP test (6.15) when p = 0.4. 

(b) Derive a UMPU test of size a for Hp : p = po versus Hi : p # po 
when n = 10, a = 0.05, and po = 0.4. 


Suppose that X has the Poisson distribution P(@) with an unknown 
6 > 0. Show that (6.29) reduces to 
cg—1 oz 16-60 


wi —09 
pe 


e=cij4+1 C= Dr ~ 1)! 


provided that c; > 1. 


6.6. Exercises 459 


34. 


35. 


36. 


37. 


38. 


39. 


40. 


Al. 


Let X be a random variable from the geometric distribution G(p). 
Find a UMPU test of size a for Hp : p = po versus H, : p € po. 


In Exercise 33 of §2.6, derive a UMPU test of size a € (0,1) for testing 
Ao :p <4 versus Hy :p> 4. 


Let X1,..., Xn be iid. from N(y,07) with unknown p and o?. 

(a) Show how the power of the one-sample t-test depends on a non- 
central t-distribution. 

(b) Show that the power of the one-sample t-test is an increasing 
function of (4 — ~o)/o for testing Hp :  < wo versus Hy : pw > Lo, 
and of | — ~o|/o for testing Ho : u = po versus Hy : pp # po. 


Let X1,..., Xp be iid. from the gamma distribution ['(6,y) with un- 
known @ and ¥. 

(a) For testing Ho : @ < 09 versus Hy : @ > 09 and Ho : 6 = Op ver- 
sus H, : 6 4 0, show that there exist UMPU tests whose rejection 
regions are based on V = []j_,(Xi/X). 

(b) For testing Ho : y < yo versus Hi : y > 70, show that a UMPU 
test rejects Ho when S7i"_, X; > C([]j_, Xi) for some function C. 


Let X; and X2 be independently distributed as the Poisson distribu- 
tions P(\;) and P(A2), respectively. 

(a) Find a UMPU test of size a for testing Ho : >1 > Ag versus 
Ay: Ay < do. 

(b) Calculate the power of the UMPU test in (a) when a = 0.1, 
(A1, \2) = (0.1, 0.2), (1,2), (10,20), and (0.1,0.4). 


Consider the binomial problem in Example 6.11. 

(a) Prove the claim about P(Y = y|U = wu). 

(b) Find a UMPU test of size a for testing Ho : pi > p2 versus 
A, : py < po. 

(c) Repeat (b) for Ho : pi = pe versus Hy : py # po. 


Let X, and X2 be independently distributed as the negative binomial 
distributions N B(pi,n1) and NB(p2, n2), respectively, where n,;’s are 
known and p;’s are unknown. 

(a) Show that there exists a UMPU test of size a for testing Ho : 
Pi < pe versus Ay: pi > po. 

(b) Determine the conditional distribution Py|y—, in Theorem 6.4 
when nj = ng = 1. 


Let (Xo,X1, X2) be a random vector having a multinomial distri- 
bution (Example 2.7) with k = 2, p) = 1— p; — po, and unknown 
pi € (0,1) and pg € (0,1). Derive a UMPU test of size a for testing 
Ho: po = p’, pi = 2p(1 — p), pe = (1 — p)? versus H; : Hp is not true, 
where p € (0,1) is unknown. 


460 


42. 


43. 


44. 


45. 


46. 


AT. 


48. 


6. Hypothesis Tests 


Consider Example 6.12. 

(a) Show that A and B are independent if and only if log - 
log 2 log 

(b) Derive a UMPU test of size a for testing Hp : P(A) = P(B) 
versus H; : P(A) #4 P(B). 


Let X; and X2 be independently distributed according to p.d.f.’s 
given by (6.10) with €, 7, 6, Y, and h replaced by &;, m:, 0;, Y;, and 
h;, 7 = 1,2, respectively. Show that there exists a UMPU test of size 
a for testing 

(a) Ho : n2(@2) — m1 (01) < no versus Hy : n2(@2) — (61) > no; 

(b) Ao: 72(02) + m (01) < o versus Ay: n2(@2) +m (81) > To. 


Let X;, 7 = 1,2,3, be independent from the Poisson distributions 
P(A;), j = 1,2,3, respectively. Show that there exists a UMPU test 
of size a for testing Ho : AiA2 < AZ versus Hy : \yA2 > A}. 


Let Xj;, 1 = 1,2, 7 = 1,2, be independent from the Poisson distribu- 
tions P(\;pij;), where A; > 0,0 < py <1, and pi +pi2 = 0. Derive a 
UMPU test of size a for testing Ho : pi, < po versus Hy : py > pat. 


Let X;; be independent random variables satisfying P(X;; = 0) = 6:, 
P(Xi; = k) = (1 —6;)(1 — pi) pi, k= Li 2isws where 0 < 6; < 1 and 
0<p<1,j =1,...,n; andi = 1,2. Derive a UMPU test of size a 
for testing Ho : p, < po versus Hy : p, > po. 


Let X11,..-, Xin, and X1,..., Xan, be two independent samples i.i.d. 
from the gamma distributions ['(61, 71) and I'(02, 72), respectively. 
(a) Assume that 6; and 62 are known. For testing Ho : y1 < y2 versus 
Ay sy > 72 and Ho : ¥1 = ye versus H, : y1 4 72, show that there 
exist UMPU tests and that the rejection regions can be determined 
by using beta distributions. 

(b) If 0,’s are unknown in (a), show that there exist UMPU tests and 
describe their general forms. 

(c) Assume that 71 = y2 (unknown). For testing Ho : 0, < 62 versus 
A, : 0, > 02 and Ho : 6, = 62 versus H, : 6, 4 02, show that there 
exist UMPU tests and describe their general forms. 


Let N be a random variable with the following discrete p.d.f.: 
P(N =n) = CA)a(n)A"Lyo,1,2,..3 (7); 


where A > 0 is unknown and a and C are known functions. Suppose 
that given N =n, X1,..., Xn are iid. from the p.d-f. given in (6.10). 
Show that, based on (N, X1,..., Xv), there exists a UMPU test of size 
a for Ho : n(0) < no versus Hy : (A) > no. 


6.6. Exercises 461 


49. 


50. 


51. 


52. 


53. 


54. 


59. 


Let Xi1,...,Xin;, = 1,2, be two independent samples i.i.d. from 
N(1;,07), respectively, nj > 2. Show that a UMPU test of size a 
for Ho : f1 = pa versus Hy : py A pg rejects Ho when |t(X)| > 
tni+nz—1,0/2, Where ¢(X) is given by (6.37) and tn,4n2-1,0 is the 
(1 — a)th quantile of the t-distribution tn,+n,-1. Derive the power 
function of this test. 


In the two-sample problem discussed in §6.2.3, show that when n, = 
n2, a UMPU test of size a for testing Ho : 3 = Aoo? versus Hy : 
a} # Ago? rejects Hy when 


ties oF Ao S$? S l-c¢ 
AoS2’ S2 c 


> 


where fj f(ny—1)/2,(m—1)/2(v)dv = a/2 and fa» is the p.d.f. of the 
beta distribution B(a, b). 


Suppose that X; = 85 + G1t; + €;, where t;’s are fixed constants that 
are not all the same, ¢;’s are i.i.d. from N(0,07), and 8, 31, and o? 
are unknown parameters. Derive a UMPU test of size a for testing 
(a) Ho : Bo < 00 versus Hy : Bo > 90; 
(b) Ho : Bo = 9 versus Hy : Bo 4 9; 
(c) Ho : Bi < @ versus Hy : 3) > 4; 
(d) Ho : By = A versus Ay : By # A. 


In the previous exercise, derive the power function in each of (a)-(d) 
in terms of a noncentral t-distribution. 


Consider the normal linear model in §6.2.3 (i-e., model (3.25) with 
e = N,(0,07J,)). For testing Ho : 0? < o@ versus Hi : 0? > o2 and 
Ho : 0? = 06 versus H; : 0? 4 0%, show that UMPU tests of size a 
are functions of SSR and their rejection regions can be determined 
using chi-square distributions. 


In the problem of testing for independence in the bivariate normal 
family, show that 

(a) the p.d.f. in (6.44) is of the form (6.23) and identify y; 

(b) the sample correlation coefficient R is independent of U when 
p=0; 

(c) R is linear in Y, and V in (6.45) has the t-distribution t,_2 when 
p=0. 


Let X1,...,Xn be iid. bivariate normal with the p.d-f. in (6.44) and 
let S? = pavan ©. 27 = X;)? and Si2 = ya — X1)(Xiz — Xo). 
(a) Show that a UMPU test for testing Ho : o2/o01 = Ao versus 
Ay : 02/01 # Ao rejects Hyp when 


462 


56. 


57. 


58. 


59. 


6. Hypothesis Tests 


R = |AGS7 — $3|/1/ (A2S? + 53)? — 4A2S2, >. 


(b) Find the p.d.f. of R in (a) when o2/o1 = Ao. 
(c) Assume that 01 = 02. Show that a UMPU test for Ho : wi = p2 
versus Hy : 1 4 pe rejects Ho when 


V= |X_ — X4|/1/S? + $3 — 2512 > C. 


(d) Find the p.d.f. of V in (c) when jy = pia. 


Let (X1,Yi),...;(Xn, Yn) be iid. random 2-vectors having the bi- 
variate normal distribution with EX, = EY, = 0, Var(X1) = o?, 
Var(Yi) = 07, and Cov(X1,¥1) = porey, where oz > 0, oy > 0, and 
p € [0,1) are unknown. Derive the form and exact distribution of a 
UMPU test of size a for testing Ho : p = 0 versus Hi : p > 0. 


Let X1,..., Xn be iid. from the exponential distribution E(a,@) with 
unknown a and 6. Let V = 25%).,(X; — X()), where X(1) is the 
smallest order statistic. 

(a) For testing Hp : 06 = 1 versus H; : @ 41, show that a UMPU test 
of size a rejects Hyp when V < c, or V > c2, where c;’s are determined 
by 


a Pan—a(v)dv = ie fon(v)dv =1—a, 


and f;(v) is the p.d-f. of the chi-square distribution x2,. 

(b) For testing Hp : a = 0 versus Hy : a 4 0, show that a UMPU 
test of size a rejects Hp when X(1) < 0 or 2nX(1)/V > c, where c is 
determined by 


(n— fra +v) "dv =1-a. 


Let X1,..., Xp bei.i.d. random variables from the uniform distribution 
U(0,0), -~ <O<V<m. 

(a) Show that the conditional distribution of X(1) given X(p) = x is 
the distribution of the minimum of a sample of size n — 1 from the 
uniform distribution U(6, x). 

(b) Find a UMPU test of size a for testing Hp : 8 < 0 versus H; : 6 > 
0. 


Let X1,...,X, be independent random variables having the bino- 
mial distributions Bi(p;,k;), i = 1,...,n, respectively, where p; = 
ett Pts /(1 + ett?) (a,b) € R? is unknown, and t;’s are known covari- 
ate values that are not all the same. Derive the UMPU test of size 
a for testing (a) Ho : a > 0 versus H; : a < 0; (b) Ho : 6 > 0 versus 
Ay :b <0. 


6.6. Exercises 463 


60. 


61. 


62. 


63. 


64. 


65. 
66. 


In the previous exercise, derive approximations to the UMPU tests 
by considering the limiting distributions of the test statistics. 


Let X = {a € R” : all components of x are nonzero} and G be the 
group of transformations g(x) = (cv1,...,c@n), ¢c > 0. Show that 
a maximal invariant under G is (sgn(an),@1/a@n,..-;€n—1/Un), where 
sgn(x) is 1 or —1 as a is positive or negative. 


Let X1,..., Xn be iid. with a Lebesgue p.d.f. o-!f(x/o) and fj, i= 
0,1, be two known Lebesgue p.d.f.’s on R that are either 0 for x < 0 
or symmetric about 0. Consider Ho : f = fo versus Hi: f = fi and 
G = {g,:r > 0} with g-(x) = rz. 

(a) Show that a UMPI test rejects Hy when 


Sov fi(vX1) +++ fi(vXn)dv 
Jo. 0" fo(vX1) +++ fo(vXn)dv 
(b) Show that if fo = N(0,1) and f1(2) = e7!*!/2, then the UMPI 
test in (a) rejects Ho when (S77, X?)4/?/S7"_, |X| >. 
(c) Show that if fo(a) = I(o,1)(x) and fi(x) = 2x1(o1)(a), then the 


UMPI test in (a) rejects Hp when X(n)/([]j_, Xi)'/" <e. 
(d) Find the value of c in part (c) when the UMPI test is of size a. 


>C. 


Consider the location-scale family problem (with unknown parame- 
ters 4 and a) in Example 6.13. 

(a) Show that W is maximal invariant under the given G. 

(b) Show that Proposition 6.2 applies and find the form of the func- 
tional 6(fi,n,0)- 

(c) Derive the p.d.f. of W(X) under H;, i = 0,1. 

(d) Obtain a UMPI test. 


In Example 6.13, find the rejection region of the UMPI test when 
X1,...,Xn are iid. and 

(a) foypo 8 N(u,07) and fic is the p.d.f. of the uniform distribu- 
tion U(u — 30,4 + 40); 

(b) fo,u.o is N(p,07) and fi,y,o is the p.d.f. of the exponential distri- 
bution E(u, 0); 

(c) fo,u,0 is the p.d-f. of U(u— $0, 4+ $0) and fiiy,o is the p.d.f. of 
E(u, 0); 

(2) fo: is N(,1) and fry(2x) = exp{—e?-# + 2 — i}. 


Prove the claims in Example 6.15. 


Let X1,..., Xn be iid. from N(,07) with unknown pz and o?. Con- 
sider the problem of testing Ho : » = 0 versus Hy : w # 0 and the 
group of transformations g.(X;) = cXi, c# 0. 


464 


67 
68 


69. 


70. 


71. 


72. 


73. 


6. Hypothesis Tests 


(a) Show that the testing problem is invariant under G. 
(b) Show that the one-sample two-sided t-test in §6.2.3 is a UMPI 
test. 


. Prove the claims in Example 6.16. 


. Consider Example 6.16 with Ho and H; replaced by Ho : wi = pe and 
Ay: jy F pe, and with G changed to {ge,,e.,r 1 C1 = C2 € Rr F OF. 
(a) Show that the testing problem is invariant under G. 

(b) Show that the two-sample two-sided t-test in §6.2.3 is a UMPI 
test. 


Show that the UMPU tests in Exercise 37(a) and Exercise 47(a) are 
also UMPI tests under G = {g, : r > 0} with g,(x) = ra. 


In Example 6.17, show that t(X) has the noncentral t-distribution 
tn—1(./n8); the family { fo(t) : 8 € R} has monotone likelihood ratio 
in t; and that for testing Ho : 0 = 09 versus H, : 0 £ Oo, a test that is 
UMP among all level a unbiased tests based on t(X) rejects Hp when 
t(X) < cy or t(X) > cg. (Hint: consider Exercise 26.) 


Let X, and X2 be independently distributed as the exponential dis- 
tributions E(0,6;), 1 = 1,2, respectively. Define 6 = 61/62. 

(a) For testing Hp : 9 < 1 versus @ > 1, show that the problem is 
invariant under the group of transformations g.(21, 22) = (c@1,c®2), 
c > 0, and that a UMPI test of size a rejects Ho when X2/X1 > 
(1—a)/a. 

(b) For testing Ho : 6 = 1 versus 6 4 1, show that the problem is 
invariant under the group of transformations in (a) and g(x1, 22) = 
(v2,21), and that a UMPI test of size a rejects Hp when X1/X2 > 
(2—a)/a and X2/X, > (2-—a)/a. 


Let X,...,Xm and Yj,...,Y, be two independent samples i.i.d. from 
the exponential distributions E(a,,0,) and E(a2, 62), respectively. 
Let grca(@,y) = (rar t+ 6...,7@m + ¢,ry1 + d,...,ryn + d) and let 
G={orea:7r >0,cER de R}. 

(a) Show that a UMPI test of size a for testing Ho : 01/02 > Ao versus 
Ay: 01/02 < Ao rejects Ap when Se %—-Yay) > Cn (Xi-X)) 
for some constant c. 

(b) Find the value of c in (a). 

(c) Show that the UMPI test in (a) is also a UMPU test. 


Let M(U) be given by (6.51) and W = M(U)(n —r)/s. 

(a) Show that W has the noncentral F-distribution F; »—,(@). 

(b) Show that fo,(w)/fo(w) is an increasing function of w for any 
given 6; > 0. 


6.6. Exercises 465 


74. 


76. 


77. 


78. 


79. 


Consider normal linear model (6.38). Show that 

(a) the UMPI test derived in §6.3.2 for testing (6.49) is the same as 
the UMPU test for (6.40) given in §6.2.3 when s = 1 and 0) = 0; 

(b) the test with the rejection region W > Fs nr, is a UMPI test 
of size a for testing Ho : LG = 69 versus H, : LG 4 Oo, where W is 
given by (6.52), @ is a fixed constant, L is the same as that in (6.49), 
and F's n—r,q is the (1 — a)th quantile of the F-distribution Fs n_,. 


. In Examples 6.18-6.19, 


(a) prove the claim in Example 6.19; 
(b) derive the distribution of W by applying Cochran’s theorem. 


(Two-way additive model). Assume that X;,’s are independent and 
Xijz = N(uij,07), 7=1, soey A, j=1,...,b, 


where pu; = pp +a; + 8; and 8, a; = ee 10; = 9. Derive the 
forms of the UMPI tests in 86.3.2 for testing (6. 54) and (6.55). 


(Three-way additive model). Assume that X;;;,’s are independent and 


Xijk = N (ijk, 07), t= T; 1 Q, j = 1, wd, k= 1, wee C, 


where pijx = W+ai+Gj+7, and 74, aj = eS) BSS 2 ye 0: 
Derive the UMPI test based on the W in (6.52) toe testing Hp : a; = 0 
for all i versus Hy : a; 4 0 for some 7. 


Let Xj,...,Xm and Yj,...,Y;, be independently normally distributed 
with a common unknown variance o? and means 


where u;’s and v;’s are known constants, @ = myo" ui, 0 = 
n} yee Vi, and fla, fy, Bx, and 3, are unknown. Derive the UMPI 
test based on the W in (6.52) for testing 

(a) Ho : Gz = Gy versus Hy : By # By; 

(b) Ho: Br = By and pig = py versus Hy : Bz A By or Uz F by. 


Let (X1,Y1),..-; (Xn, Yn) be iid. from a bivariate normal distribution 
with unknown means, variances, and correlation coefficient p. 

(a) Show that the problem of testing Ho : p < po versus Hy : p > po 
is invariant under G containing transformations rX; +c, sY; + d, 
i= 1,...,n, where r > 0,5 >0,c€R, andd € R. Show that a 
UMPI test rejects Hy when R > c, where R is the sample correlation 
coefficient given in (6.45). (Hint: see Lehmann (1986, p. 340).) 

(b) Show that the problem of testing Ho : p = 0 versus Hi : p # 
0 is invariant in addition (to the transformations in (a)) under the 
transformation g(X;, Yi) = (Xi, -Yi), i =1,...,n. Show that a UMPI 
test rejects Hp when |R| > c. 


466 


80. 


81. 
82. 


83. 


84. 


85. 


86. 


87. 


88. 


89. 


6. Hypothesis Tests 


Under the random effects model (6.57), show that 

(a) SSA/S'SR is maximal invariant under the group of transforma- 
tions described in §6.3.2; 

(b) the UMPI test for (6.58) derived in §6.3.2 is also a UMPU test. 


Show that (6.60) is equivalent to (6.61) when co > 1. 


In Proposition 6.5, _ 
(a) show that log £(@) — log £(00) is strictly increasing (or decreasing) 
in Y when 6 > (or 6 < 4); 


(b) prove part (iii). 


In Exercises 40 and 41 of §2.6, consider Ho : 7 = 1 versus Hy: j = 2. 
(a) Derive the likelihood ratio \(X). 
(b) Obtain an LR test of size a in Exercise 40 of §2.6. 


In Exercise 17, derive the likelihood ratio \(X) when (a) Ho : 6 < 40; 
(b) Ao : 0, <0 < 09; and (c) Hp :86< 4, or 0 > @. 


Let X1,...,Xn be i.i.d. from the discrete uniform distribution on 
{1,...,@}, where 0 is an integer > 2. Find a level a LR test for 

(a) Ho : 8 < 6 versus H; : 8 > 00, where 6 is a known integer > 2; 
(b) Hp : 6 = 09 versus Hy : 0 # Oo. 


Let X be a sample of size 1 from the p.d.f. 20~?(0—)I(o,9)(«), where 
8 > 0 is unknown. Find an LR test of size a for testing Ho : 6 = 4% 
versus H, : 0 # %. 


Let X1,..., Xn be i.i.d. from the exponential distribution F(a, 6). 

(a) Suppose that 6 is known. Find an LR test of size a for testing 
Hp :a< ao versus Hy : a> ao. 

(b) Suppose that @ is known. Find an LR test of size a for testing 
Hy: a = ao versus Hy, : a 4 ao. 

(c) Repeat part (a) for the case where @ is also unknown. 

(d) When both @ and a are unknown, find an LR test of size a for 
testing Ho : 0 = 0 versus H, : 60 FO. 

(e) When a > 0 and @ > 0 are unknown, find an LR test of size a for 
testing Hp : a = @ versus Hy :a #0. 


Let Xj,...,X, be iid. from the Pareto distribution Pa(7, 6), where 
8 >0 andy > 0 are unknown. Show that an LR test for Hp : 0 = 1 
versus H, : 0 £1 rejects Hp when Y < c or Y > co, where Y = 
log (TT, Xi/XG)) and c, and cz are positive constants. Find values 
of c; and cz so that this LR test has size a. 


Let Xi1,...,Xin;, i = 1,2, be two independent samples i.i.d. from the 
uniform distributions U(0,6;), i = 1,2, respectively, where 6; > 0 


6.6. Exercises 467 


90. 


91. 


92. 


93. 


94. 


95. 


and 92 > 0 are unknown. 

(a) Find an LR test of size a for testing Hp :6; =0@2 versus Hy, :6; 402. 
(b) Derive the limit distribution of —2 log A, where is the likelihood 
ratio in part (a). 


Let Xi1,...,Xin;, t = 1,2, be two independent samples i.i.d. from 
N(wi,o7), i = 1,2, respectively, where ju;’s and o?’s are unknown. 
For testing Ho : 03/07 = Ao versus Hy : o3/o07 # Ao, derive an LR 
test of size a and compare it with the UMPU test derived in 86.2.3. 


Let (X11, X12), --;(Xn1, Xn2) be ii.d. from a bivariate normal dis- 
tribution with unknown mean and covariance matrix. For testing 
Hy : p = 0 versus H, : p # 0, where p is the correlation coefficient, 
show that the test rejecting Hp when |W| > c is an LR test, where 


n 


(Xi ~ X1)(Xie~ %)/ Som — XX)? + Xia — X2)?}. 


i=1 


W= 


4 


Find the distribution of W. 


n 


1 


Let X; and X2 be independently distributed as the Poisson distribu- 
tions P(A;) and P(,2), respectively. Find an LR test of significance 
level a for testing 

(a) Ho: A1 = Ae versus Hy: Ay Fo; 

(b) Ho : Ay > Az versus Hy : Ay < Ag. (Is this test a UMPU test?) 


Let X, and X2 be independently distributed as the binomial distri- 
butions Bi(pi,n1) and Bi(p2,n2), respectively, where n;’s are known 
and p;’s are unknown. Find an LR test of significance level a for 
testing 

(a) Ho : pi = po versus Hy : py 4 po; 

(b) Ho : pi > p2 versus Hy : pi < po. (Is this test a UMPU test?) 


Let X, and X2 be independently distributed as the negative binomial 
distributions N B(pi,n1) and NB(p2, n2), respectively, where n,;’s are 
known and p,;’s are unknown. Find an LR test of significance level a 
for testing 

(a) Ho : pi = po versus Hy : py 4 po; 

(b) Ho : pi < po versus Hy : pi > po. 


Let X, and X2 be independently distributed as the exponential dis- 
tributions F:(0,0;), 7 = 1,2, respectively. Define @ = 0)/62. Find an 
LR test of size a for testing 

(a) Hp :@=1 versus H,: 041; 

(b) Hp: 0 <1 versus Hy :@> 1. 


468 


96. 


97. 


98. 


99. 


100. 


101. 


102. 
103. 


104. 


105. 


6. Hypothesis Tests 


Let Xi1,...,Xin,;, ¢ = 1,2, be independently distributed as the beta 
distributions with p.d.f.’s 60°" T,1)(2), i = 1,2, respectively. For 
testing Ho : 6; = 62 versus H, : 0, 4 Oo, find the forms of the LR 
test, Wald’s test, and Rao’s score test. 


In the proof of Theorem 6.6(ii), show that (6.65) and (6.66) hold and 
that (6.67) implies (6.68). 


Let X,..., Xn be iid. from N(p, 07). 

(a) Suppose that o? = yy? with unknown y > 0 and p € R. Find an 
LR test for testing Ho : y = 1 versus H, : y £1. 

(b) In the testing problem in (a), find the forms of W,, for Wald’s test 
and R, for Rao’s score test, and discuss whether Theorems 6.5 and 
6.6 can be applied. 

(c) Repeat (a) and (b) when o? = yy with unknown y > 0 and p > 0. 


Suppose that X1,...,X, are iid. from the Weibull distribution with 
p.df. 0-lya1~te-®"/9 To.) (x), where y > 0 and @ > 0 are unknown. 
Consider the problem of testing Hp : y = 1 versus H, : 7 #1. 

(a) Find an LR test and discuss whether Theorem 6.5 can be applied. 
(b) Find the forms of W,, for Wald’s test and R, for Rao’s score test. 


Suppose that X = (X1,...,X,) has the multinomial distribution with 
the parameter p = (p1,..., px). Consider the problem of testing (6.70). 
Find the forms of W,, for Wald’s test and R,, for Rao’s score test. 


In Example 6.12, consider testing Hp : P(A) = P(B) versus Hj : 
P(A) # P(B). 

(a) Derive the likelihood ratio », and the limiting distribution of 
—2logA, under Ho. 

(b) Find the forms of W,, for Wald’s test and R, for Rao’s score test. 


Prove the claims in Example 6.24. 


Consider testing independence in the r x c contingency table problem 
in Example 6.24. Find the forms of W, for Wald’s test and R,, for 
Rao’s score test. 


Under the conditions of Theorems 6.5 and 6.6, show that Wald’s tests 

are Chernoff-consistent (Definition 2.13) if @ is chosen to be a, — 0 

and x?,,,, = 0(n) as n — oo, where x2, is the (1 — a)th quantile of 
x ; 

Xr: 


Let X1,..., Xp be ii.d. binary random variables with @ = P(X; = 1). 
(a) Let the prior II(@) be the c.d.f. of the beta distribution B(a, b). 
Find the Bayes factor and the Bayes test for Ho : 6 < 0 versus 
Ay :0> 9. 


6.6. Exercises 469 


106. 


107. 


108. 


109. 


110. 


111. 


112. 


113. 


114. 


115. 


(b) Let the prior c.d.f. be 701[9,,00)(@) + (1 — m0) H(A), where II is the 
same as that in (a). Find the Bayes factor and the Bayes test for 
Ho: 0 = 9 versus H, : 604 Op. 


Let X1,..., Xn be i.i.d. from the Poisson distribution P(6). 

(a) Let the prior c.d.f. be II(@) = (1 — e~®) 10,0) (9). Find the Bayes 
factor and the Bayes test for Ho : 8 < #9 versus Hy : 6 > 6. 

(b) Let the prior c.d.f. be 701[9,,00)(9) + (1 — mo) H(@), where II is the 
same as that in (a). Find the Bayes factor and the Bayes test for 
Ho: 0 = 9 versus H, : 604 . 


Let X;, 7 = 1,2, be independent observations from the gamma dis- 
tributions ['(a,71) and ['(a, 2), respectively, where a > 0 is known 
and y; > 0, 7 = 1,2, are unknown. Find the Bayes factor and the 
Bayes test for Hp : yi = yo versus Hy : y1 # Yo under the prior 
c.d.f. II = Tollo + (1 = mo)Ili, where TIo(x1, ©2) = G(min{x1, 22}), 
Ili (a1, 22) = G(a1)G(a2), G(x) is the c.d.f. of a known gamma dis- 
tribution, and 7 is a known constant. 


Find a condition under which the UMPI test given in Example 6.17 
is better than the sign test given by (6.78) in terms of their power 
functions under Hy. 


For testing (6.80), show that a test T satisfying (6.81) is of size a and 
that the test in (6.82) satisfies (6.81). 


Let G be the class of transformations g(x) = (¢(#1),...,~(an)), where 
w is continuous, odd, and strictly increasing. Let R be the vector of 
ranks of ||, ...,|@n| and Ry (or R_) be the subvector of R contain- 
ing ranks corresponding to positive (or negative) 2;’s. Show that 
(R, R_) is maximal invariant under G. (Hint: see Example 6.14.) 


Under Ho, obtain the distribution of W in (6.83) for the one-sample 
Wilcoxon signed rank test when n = 3 or 4. 


For the one-sample Wilcoxon signed rank test, show that to and o3 


in (6.85) are equal to + and >: respectively. 


Using the results in 85.2.2, derive a two-sample rank test for testing 
(6.80) that has limiting size a. 


Prove Theorem 6.10(i) and show that D> (F) and D;(F) have the 
same distribution. 


Show that the one-sided and two-sided Kolmogorov-Smirnov tests are 
consistent according to Definition 2.13. 


470 


116. 


117. 
118. 


119. 


120. 
121. 


122. 


123. 


124. 


6. Hypothesis Tests 


Let C,(F) be given by (6.88) for any continuous c.d.f. F on R. Show 
that the distribution of C,,(F’) does not vary with F. 


Show that the Cramér-von Mises tests are consistent. 


In Example 6.27, show that the one-sample Wilcoxon signed rank test 
is consistent. 


Let X1,..., Xn be iid. from ac.df. F on R4 and 6 = E(X)). 

(a) Derive the empirical likelihood ratio for testing Hp : 0 = 09 versus 
Ay :0 # A. 

(b) Let 0 = (¥,y). Derive the profile empirical likelihood ratio for 
testing Hp : 3) = Vo versus Hi: 0 F Vo. 


Prove Theorem 6.12(ii). 


Let Xi1,...,Xin,, ¢ = 1,2, be two independent samples i.i.d. from F; 
on R, i = 1,2, respectively, and let uw; = E(X;). 

(a) Show that the two-sample t-test derived in §6.2.3 for testing Hp : 
[41 = pl versus Hy : py # pe has asymptotic significance level @ and 
is consistent, if ny + 00, n1/n2 > c € (0,1), and of = 03. 

(b) Derive a consistent asymptotic test for testing Ho : i/p2 = Ao 
versus Hy : f1/p2 # Ao, assuming that pig #4 0. 


Consider the general linear model (3.25) with iid. ¢;’s having E(e;) = 
0 and E(e?) = o?. 

(a) Under the conditions of Theorem 3.12, derive a consistent asymp- 
totic test based on the LSE I7B for testing Hp : I7G = 65 versus 
Hy: 178 4 4%, where 1 € R(Z). 

(b) Show that the LR test in Example 6.21 has asymptotic significance 
level a and is consistent. 


Let 6, be an estimator of a real-valued parameter 6 such that (6.92) 
holds for any @ and let V, be a consistent estimator of Vp. Suppose 
that V, — 0. 

(a) Show that the test with rejection region Vz, /? (6,00) > za-aqisa 
consistent asymptotic test for testing Ho : 6 < 09 versus H, : @ > 4. 
(b) Apply the result in (a) to show that the one-sample one-sided 
t-test in 86.2.3 is a consistent asymptotic test. 


Let X1,...,Xn be ii.d. from the gamma distribution ['(6,), where 
6 > 0 and y > 0 are unknown. Let T, = n>7y, X?/(0y_, Xi)”. 
Show how to use 7), to obtain an asymptotically correct test for Ho : 
?=1 versus H,: 041. 


Chapter 7 


Confidence Sets 


Various methods of constructing confidence sets are introduced in this chap- 
ter, along with studies of properties of confidence sets. Throughout this 
chapter X = (Xj,...,X,) denotes a sample from a population P € P; 
6 = 6(P) denotes a functional from P to 6 C R¥ for a fixed integer k; and 
C(X) denotes a confidence set for 0, a set in Bo (the class of Borel sets on 
©) depending only on X. We adopt the basic concepts of confidence sets 
introduced in §2.4.3. In particular, infpep P(O € C(X)) is the confidence 
coefficient of C(X) and, if the confidence coefficient of C(X) is > 1—a for 
fixed a € (0,1), then we say that C(X) has significance level 1— a or C(X) 
is a level 1 — a confidence set. 


7.1 Construction of Confidence Sets 


In this section, we introduce some basic methods for constructing confidence 
sets that have a given significance level (or confidence coefficient) for any 
fixed n. Properties and comparisons of confidence sets are given in §7.2. 


7.1.1 Pivotal quantities 


Perhaps the most popular method of constructing confidence sets is the use 
of pivotal quantities defined as follows. 


Definition 7.1. A known Borel function ® of (X,6) is called a pivotal 
quantity if and only if the distribution of R(X, 0) does not depend on P. I 


Note that a pivotal quantity depends on P through 6 = 0(P). A pivotal 
quantity is usually not a statistic, although its distribution is known. 


A71 


472 7. Confidence Sets 


With a pivotal quantity R(X,@), a level 1 — a confidence set for any 
given a € (0,1) can be obtained as follows. First, find two constants c; and 
cg such that 

P(cy < R(X,0) <c2) >l-a. (7.1) 


Next, define 
C(X) = {0 € 0: Cc < R(X,9) < co}. (7.2) 


Then C(X) is a level 1 — a confidence set, since 


inf P(@ € C(X)) = inf Pla < R(X, 0) < 2) 


I 


P(e < R(X,A) < C2) 
>l-a. 


Note that the confidence coefficient of C(X) may not be 1— a. If R(X, A) 
has a continuous c.d.f., then we can choose c;’s such that the equality in 
(7.1) holds and, therefore, the confidence set C'(X) has confidence coefficient 
l-a. 

In a given problem, there may not exist any pivotal quantity, or there 
may be many different pivotal quantities. When there are many pivotal 
quantities, one has to choose one based on some principles or criteria, which 
are discussed in §7.2. For example, pivotal quantities based on sufficient 
statistics are certainly preferred. In many cases we also have to choose c;’s 
in (7.1) based on some criteria. 


When R(X, @) and c;’s are chosen, we need to compute the confidence 
set C(X) in (7.2). This can be done by inverting c; < R(X,0) < co. For 
example, if @ is real-valued and #(X,6@) is monotone in 6 when X is fixed, 
then C(X) = {0 : 0(X) < 60 < 6(X)} for some O(X) < 0(X), ie, C(X) 
is an interval (finite or infinite); if R(X,6) is not monotone, then C(X) 
may be a union of several intervals. For real-valued 6, a confidence interval 
rather than a complex set such as a union of several intervals is generally 
preferred since it is simple and the result is easy to interpret. When @ is 
multivariate, inverting c. < R(X,0) < cop may be complicated. In most 
cases where explicit forms of C(X) do not exist, C(X) can still be obtained 
numerically. 


Example 7.1 (Location-scale families). Suppose that X1,...,X, are i.i.d. 
with a Lebesgue p.d.f. +f (=#), where p € R, o > 0, and f is a known 
Lebesgue p.d.f. 


Consider first the case where o is known and 6 = wp. For any fixed i, 
X; — is a pivotal quantity. Also, X — py is a pivotal quantity, since any 
function of independent pivotal quantities is pivotal. In many cases X — p is 
preferred. Let c, and cz be constants such that P(e, < X—p < cz) = 1—a. 


7.1. Construction of Confidence Sets 473 


Then C(X) in (7.2) is 
O(X) ={pia <X-—pse}={p:X-a<psX-c}, 


ie., C(X) is the interval [X —c2, X —ci] C R = ©. This interval has confi- 
dence coefficient 1 — a. The choice of ¢;’s is not unique. Some criteria dis- 
cussed in §7.2 can be applied to choose c;’s. One particular choice (not nec- 
essarily the best choice) frequently used by practitioners is cj = —c2. The 
resulting C(X) is symmetric about X and is also an equal-tail confidence 
interval (a confidence interval [9,6] is equal-tail if P(@ < 0) = P(@ > @)) 
if the distribution of X is symmetric about ps. Note that the confidence 
interval in Example 2.31 is a special case of the intervals considered here. 


Consider next the case where yw is known and 6 = a. The following 
quantities are pivotal: (X;—)/o,i=1,...,n, []j_,(Xi-—p)/o, (X —p)/o, 
and S/o, where S? is the sample variance. Consider the confidence set (7.2) 
with R = S/o. Let c, and cp be chosen such that P(c, < S/o < c2) = 1—-a. 
If both c;’s are positive, then 


C(X) ={0:S/ceg <a < S/er} = [$/c2, S/e1] 


is a finite interval. Similarly, if c, = 0 (0 < c < 0) or@M=w(0<a< 
oo), then C(X) = [S/c2, 00) or (0, S/ci]. 

When @ = o and yp is also unknown, S/o is still a pivotal quantity 
and, hence, confidence intervals of o based on S are still valid. Note that 
(X — p)/o and []j_, (Xi — )/o are not pivotal when p is unknown. 


Finally, we consider the case where both and o are unknown and 
6? = pw. There are still many different pivotal quantities, but the most 
commonly used pivotal quantity is ((X) = /n(X — p)/S. The distribution 
of t(X) does not depend on (4,0). When f is normal, t(X) has the t- 
distribution t,-1. The pivotal quantity t(X) is often called a studentized 
statistic or t-statistic, although t(X) is not a statistic and t(X) does not 
have a t-distribution when f is not normal. A confidence interval for 
based on ¢(X) is of the form 


{user < Vn(X — p)/S < co} = [X — @S/J/n, X — a S/Vn), 
where c;’s are chosen so that P(cy <t(X)<c2)=1l-—a 1 


Example 7.2. Let X1,...,X», be i.i.d. random variables from the uniform 
distribution U(0, 6). Consider the problem of finding a confidence set for 0. 
Note that the family P in this case is a scale family so that the results in 
Example 7.1 can be used. But a better confidence interval can be obtained 
based on the sufficient and complete statistic X(,) for which X(,)/0 is a 
pivotal quantity (Example 7.13). Note that X(,)/0 has the Lebesgue p.d.f. 


474 7. Confidence Sets 


na"—'Io1)(a). Hence ¢;’s in (7.1) should satisfy cf} — c? = 1—a. The 
resulting confidence interval for 6 is [cy 'X(n), cy 'X(n)]- Choices of ¢;’s are 
discussed in Example 7.13. 


Example 7.3 (Fieller’s interval). Let (Xi1,Xi2), ¢ = 1,...,n, be iid. 
bivariate normal with unknown pj; = E(X1;), 07 = Var(X1;), 7 = 1,2, 
and oy2 = Cov(X11, X12). Let 6 = pe/f1 be the parameter of interest 
(141 # 0). Define Y;(0) = X52 a 0X j1. Then Y, (9), wy Yn (0) are i.i.d. from 
N(0, 03 — 20012 + 6707). Let 


n 


S?(0) = : [Yi(0) — Y(0)|? = $3 — 2651. + 6752, 


where Y (0) is the average of Y;(0)’s and $? and $12 are sample variances 
and covariance based on X;;’s. It follows from Examples 1.16 and 2.18 
that /nY(0)/S(0) has the t-distribution t,_1 and, therefore, is a pivotal 
quantity. Let tn—1i,q be the (1 — a)th quantile of the t-distribution t,_1. 
Then 


O(X) = {9 n[¥(9)]?/S8°@) < tha ajo} 


is a confidence set for @ with confidence coefficient 1 — a. Note that 
n[Y (0)|? = ty 10/25" (8) defines a parabola in 6. Depending on the roots 
of the parabola, C(X) can be a finite interval, the complement of a finite 


interval, or the whole real line (exercise). I 


Example 7.4. Consider the normal linear model X = N,,(ZG,07I,n), 
where 0 = @ is a p-vector of unknown parameters and Z is a known n x p 
matrix of full rank. A pivotal quantity is 


(6 = 8)" 2" 2(6 — B)/p 
|X — ZB)? /(n - p) 


where B is the LSE of 3. By Theorem 3.8 and Example 1.16, R(X, 3) has 
the F-distribution Fyn». We can then obtain a confidence set 


R(X, 8) = 


C(X) = {8:1 < R(X, B) < ea}. 
Note that {G: R(X, 3) < c} is the interior of an ellipsoid in R?. 


The following result indicates that in many problems, there exist pivotal 
quantities. 


Proposition 7.1. Let T(X) = (T1(X),...,T;(X)) and T),...,T; be in- 
dependent statistics. Suppose that each T; has a continuous c.d.f. Frr,.6 
indexed by 6. Then R(X, 0) = []}_, Fr,,o(Ti(X)) is a pivotal quantity. 


7.1. Construction of Confidence Sets A475 


Proof. The result follows from the fact that Fr, 9(T;)’s are i.i.d. from the 
uniform distribution U(0,1). Il 


When Xj,..., Xp are iid. from a parametric family indexed by 6, the 
simplest way to apply Proposition 7.1 is to take T(X) = X. However, 
the resulting pivotal quantity may not be the best pivotal quantity. For 
instance, the pivotal quantity in Example 7.2 is a function of the one ob- 
tained by applying Proposition 7.1 with T(X) = X(n) (s = 1), which is 
better than the one obtained by using T(X) = X (Example 7.13). 

The result in Proposition 7.1 holds even when P is in a nonparametric 
family, but in a nonparametric problem, it may be difficult to find a statistic 
T whose c.d.f. is indexed by 6, the parameter vector of interest. 

When @ and T in Proposition 7.1 are real-valued, we can use the follow- 
ing result to construct confidence intervals for 6 even when the c.d.f. of T 
is not continuous. 


Theorem 7.1. Suppose that P is in a parametric family indexed by a 
real-valued 0. Let T'(X) be a real-valued statistic with c.d.f. Fr,9(t) and let 
a, and ay be fixed positive constants such that a; + a2 =a < 4. 

(i) Suppose that Fr.9(t) and Fr9(t—) are nonincreasing in 6 for each fixed 
t. Define 


0 =sup{6: Fro(L) > ai} and 0 =inf{O: Fro(T—-) <1-ag}. 


Then [6(T),0(T)] is a level 1 — a confidence interval for 0. 
(ii) If Fr.9(t) and F9(t—) are nondecreasing in 6 for each t, then the same 
result holds with 


6 =inf{0: Fro(T) > ar} and 6 =sup{@: Fro(T—) < 1— ap}. 


(iii) If Fr is a continuous c.d-f. for any 6, then F7,9(T) is a pivotal quantity 
and the confidence interval in (i) or (ii) has confidence coefficient 1 — a. 
Proof. We only need to prove (i). Under the given condition, 0 > 4 implies 
Fro(T) <a,andd<@ implies Fro(T-) > 1- ap. Hence, 


P(@<0 <0) >1-—P(Fro(T) < a1) — P(Fro(T-) > 1-2). 
The result follows from 
P(Fre(T)<ai) <a, and P(Fre(T—) >1—az) < av. (7:3) 
The proof of (7.3) is left as an exercise. I 
When the parametric family in Theorem 7.1 has monotone likelihood 


ratio in T(X), it follows from Lemma 6.3 that the condition in Theorem 
7.1(i) holds; in fact, it follows from Exercise 2 in §6.6 that F9(t) is strictly 


476 7. Confidence Sets 


decreasing for any t at which 0 < Fra(t) < 1. If Fro(t) is also continuous 
in 0, limg_.6_ Fro(t) > a1, and limo 6, Fro(t) < a1, where 0_ and Oy 
are the two ends of the parameter space, then @ is the unique solution of 
Fr9(t) = a,. A similar conclusion can be drawn for 0. 


Theorem 7.1 can be applied to obtain the confidence interval for @ in 
Example 7.2 (exercise). The following example concerns a discrete Fr. 


Example 7.5. Let Xj,...,X, be iid. random variables from the Poisson 
distribution P(@) with an unknown 6 > 0 and T(X) = S7y_, X;. Note that 
T is sufficient and complete for @ and has the Poisson distribution P(n@). 
Thus, 


t —no Q)3 
Fro (t) yo $= 045 Det, 
jo 


Since the Poisson family has monotone likelihood ratio in T and 0 < 
Fro(t) < 1 for any t, Fro(t) is strictly decreasing in 0. Also, Fr(t) 
is continuous in @ and F’r.(t) tends to 1 and 0 as @ tends to 0 and o, 
respectively. Thus, Theorem 7.1 applies and @ is the unique solution of 
Fro(T) = ay. Since Fro(t—) = Froe(t — 1) for t > 0, @ is the unique 
solution of Fr.9(t — 1) =1-— a2 when T =t > 0 and @=0 when T =0. In 
fact, in this case explicit forms of @ and 0 can be obtained from 


t-1 


1 j- og as e-*\ 
— xe "dz = —, t= 1,2) .0.3 
T(t) Jy d. 


jo 


Using this equality, it can be shown (exercise) that 
0 = (2n)~*x50741),01 and 0 = (2n) xo74- a3 (7.4) 


where x7, is the (1 — a)th quantile of the chi-square distribution y? and 
X6,q is defined to be 0. It 


So far we have considered examples for parametric problems. In a non- 
parametric problem, a pivotal quantity may not exist and we have to con- 
sider approximate pivotal quantities (§7.3 and §7.4). The following is an 
example of a nonparametric problem in which there exist pivotal quantities. 


Example 7.6. Let Xj,...,X, be i.i.d. random variables from F' € F¥ con- 
taining all continuous and symmetric distributions on R. Suppose that 
F is symmetric about 6. Let R(@) be the vector of ranks of |X; — 0|’s 
and R,(0) be the subvector of R(@) containing ranks corresponding to 
positive (X; — 6)’s. Then, any real-valued Borel function of R+(@) is a 
pivotal quantity (see the discussion in §6.5.1). Various confidence sets can 
be constructed using these pivotal quantities. More details can be found in 
Example 7.10. ff 


7.1. Construction of Confidence Sets AT7 


7.1.2 Inverting acceptance regions of tests 


Another popular method of constructing confidence sets is to use a close 
relationship between confidence sets and hypothesis tests. For any test T’, 
the set {a : T(x) # 1} is called the acceptance region. Note that this 
terminology is not precise when T' is a randomized test. 


Theorem 7.2. For each 4 € O, let Ty, be a test for Ho : 6 = O (versus 
some H,) with significance level a and acceptance region A(0o). For each 
x in the range of X, define 


C(a) = {0:2 € A(O)}. 


Then C(X) is a level 1 — a confidence set for 0. If Ty, is nonrandomized 
and has size a for every 0, then C(X) has confidence coefficient 1 — a. 
Proof. We prove the first assertion only. The proof for the second assertion 
is similar. Under the given condition, 


sup P(X ¢ A(00)) = sup P(Ta, = 1) <a, 
0=6o 0=6o 


which is the same as 
1—a< inf P(X € A(o)) = inf P(4 € C(X)). 
0=00 0=00 
Since this holds for all #9, the result follows from 


inf P@ EO(X)) = inf, inf P(6o EC(X))>l-a 1 


The converse of Theorem 7.2 is partially true, which is stated in the 
next result whose proof is left as an exercise. 


Proposition 7.2. Let C(X) be a confidence set for @ with significance level 
(or confidence coefficient) 1— a. For any 6 € O, define a region A(69) = 
{x : 09 € C(x)}. Then the test T(X) = 1 — I49,)(X) has significance level 
a for testing Hp : 6 = 69 versus some Hy. J 


In general, C(X) in Theorem 7.2 can be determined numerically, if it 
does not have an explicit form. Theorem 7.2 can be best illustrated in 
the case where @ is real-valued and A(#) = {Y : a(@) < Y < 6(6)} for 
a real-valued statistic Y(X) and some nondecreasing functions a(@) and 
b(0). When we observe Y = y, C(X) is an interval with limits @ and 0, 
which are the 6-values at which the horizontal line Y = y intersects the 
curves Y = 6(6) and Y = a(@) (Figure 7.1), respectively. If y = b(@) (or 
y = a(@)) has no solution or more than one solution, @ = inf{@: y < b(0)} 


478 7. Confidence Sets 


8 


Figure 7.1: A confidence interval obtained by inverting A(@) = [a(@), b(@)| 


(or 6 = sup{@ : a(0) < y}). C(X) does not include @ (or @) if and only if at 
@ (or 8), b(@) (or a(@)) is only left-continuous (or right-continuous). 
Example 7.7. Suppose that X has the following p.d.f. in a one-parameter 
exponential family: fo(~) = exp{n(@)Y (a) — €(@)}A(x), where @ is real- 
valued and 7(@) is nondecreasing in 6. First, we apply Theorem 7.2 with 
Ho :0= 69 and H, : 0 > 69. By Theorem 6.2, the acceptance region of the 
UMP test of size a given by (6.11) is A(@)) = {a : Y(x) < c(o)}, where 
c(09) = c in (6.11). It can be shown (exercise) that c(@) is nondecreasing in 
6. Inverting A(@) according to Figure 7.1 with b(@) = c(@) and a(@) ignored, 
we obtain C(X) = [@(X), co) or (@(X), co), a one-sided confidence interval 
for 0 with significance level 1 — a. (@(X) is a called a lower confidence 
bound for 6 in §2.4.3.) When the c.d.f. of Y(X) is continuous, C(X) has 
confidence coefficient 1 — a. 

In the previous derivation, if Ho : 6 = 09 and Hy : 8 < @ are consid- 
ered, then C(X) = {0 : Y(X) > c()} and is of the form (—0o0, @(X)] or 
(—o0, 0(X)). (@(X) is called an upper confidence bound for 6.) 

Consider next Ho : 0 = 09 and H, : 6 4 0. By Theorem 6.4, the 
acceptance region of the UMPU test of size a defined in (6.28) is given by 
A(90) = {x : c1(00) < Y(x) < c2(60)}, where c;(0) are nondecreasing (ex- 
ercise). A confidence interval can be obtained by inverting A(@) according 
to Figure 7.1 with a(@) = c1(0) and b(@) = co(6). 


7.1. Construction of Confidence Sets 479 


Let us consider a specific example in which Xj,..., Xp» are i.i.d. binary 
random variables with p = P(X; = 1). Note that Y(X) = Uy, Xi. 
Suppose that we need a lower confidence bound for p so that we consider 
Ho : p= po and H, : p> po. From Example 6.2, the acceptance region of 
a UMP test of size a € (0,1) is A(po) = {y: y < m(po)}, where m(po) is 
an integer between 0 and n such that 


n 


» (")ta — po)" 3 <a< . (")ta =ppyt-F: 


j=m(po)+1 J=™(po) 
Thus, m(p) is an integer-valued, nondecreasing step-function of p. Define 


n 


p=inf{p: m(p) > y} =inf 4p: S- ("\ora py tS a (7.5) 


j=y 


Then a level 1—a confidence interval for p is (p, 1] (exercise). One can com- 
pare this confidence interval with the one obtained by applying Theorem 
7.1 (exercise). See also Example 7.16. 


Example 7.8. Suppose that X has the following p.d.f. in a multiparam- 
eter exponential family: fo,(z) = exp {@Y(x) + y7U(x) —¢(0,y)}. By 
Theorem 6.4, the acceptance region of a UMPU test of size a for testing 
Ho: @= 9 versus H, : 6 > 09 or Hp : 6 = 0 versus H, : 6 4 Op is 


A(90) = {(y, 4) = y S ca(u, Ao)} 


A(90) = {(y,u) : 1(u, 0) < y S ca(u, Bo)}, 


where c;(u, 0), i = 1,2, are nondecreasing functions of 6. Confidence inter- 
vals for 6 can then be obtained by inverting A(@) according to Figure 7.1 
with b(0) = co(u, 6) and a(@) = ci(u, 8) or a(0) = —oo, for any observed u. 

Consider more specifically the case where X; and X2 are independently 
distributed as the Poisson distributions P(A) and P(,2), respectively, and 
we need a lower confidence bound for the ratio p = A2/A1. From Example 
6.11, a UMPU test of size a for testing Hp : p = po versus Hi : p > po 
has the acceptance region A(po) = {(y,u) : y < c(u, po) }, where c(u, po) is 
determined by the conditional distribution of Y = X2 given U = X,4+X2 = 
u. Since the conditional distribution of Y given U = u is the binomial 
distribution Bi(p/(1 + p),u), we can use the result in Example 7.7, ice., 
c(u, p) is the same as m(p) in Example 7.7 with n = u and p = p/(1+ p). 
Then a level 1 — a lower confidence bound for p is p given by (7.5) with 
n =u. Since p = p/(1 — p) is a strictly increasing function of p, a level 
1 — a lower confidence bound for p is p/(1—p). = 


480 7. Confidence Sets 


Example 7.9. Consider the normal linear model X = N,,(Z3,07I,) and 
the problem of constructing a confidence set for 6 = LG, where L is an 
s X p matrix of rank s and all rows of L are in R(Z). It follows from the 
discussion in §6.3.2 and Exercise 74 in §6.6 that a nonrandomized UMPI 
test of size a for Ho : 0 = 09 versus H; : 0 F O has the acceptance region 


A(60) = {X : W(X, 40) < ca}, 
where cq is the (1 — a)th quantile of the F-distribution F; n_,, 


[|X — Z6(@)|? — |X — Z6I|?]/s 


)= ~ 
eee) |X — ZAIP/(m =n) 


’ 


r is the rank of Z, r > s, 3 is the LSE of @ and, for each fixed 8, B(0) isa 
solution of 
_ 7R(O\I2 — mi = 2 
IX — Z9(0)|? = min, LX - ZAI? 


Inverting A(@), we obtain the following confidence set for 0 with confidence 
coefficient l—a: C(X) = {0: W(X, 6) < ca}, which forms a closed ellipsoid 
inR*. ff 


The last example concerns inverting the acceptance regions of tests in 
a nonparametric problem. 


Example 7.10. Consider the problem in Example 7.6. We now derive a 
confidence interval for 0 by inverting the acceptance regions of the signed 
rank tests given by (6.84). Note that testing whether the c.d.f. of X; is 
symmetric about @ is equivalent to testing whether the c.d.f. of X; — @ is 
symmetric about 0. Let c;’s be given by (6.84), W be given by (6.83), 
and, for each 6, let R5(@) be the vector of ordered components of R, (6) 
described in Example 7.6. A level 1 — a confidence set for 6 is 


C(X) = {02 a < W(AL()) < ea}. 
The region C(X) can be computed numerically for any observed X. From 
the discussion in Example 7.6, W(R%.(@)) is a pivotal quantity and, there- 


fore, C(X) is the same as the confidence set obtained by using a pivotal 
quantity. 


7.1.3 The Bayesian approach 


In Bayesian analysis, analogues to confidence sets are called credible sets. 
Consider a sample X from a population in a parametric family indexed by 


7.1. Construction of Confidence Sets 481 


6€OCc R* and dominated by a o-finite measure. Let fg(x) be the p.d-f. 
of X and 7(0) be a prior p.d.f. w.r.t. a o-finite measure A on (0, Be). Let 


Px(9) = fo(x)r(0)/m(x) 


be the posterior p.d.f. w.r.t. A, where x is the observed X and m(x) = 
Jo fo(a)n(@)dA. For any a € (0,1), a level 1 — a credible set for @ is any 
C € Be with 

Poin (8 € C) = : pe(O)dA > 1l—-a. (7.6) 

C 
A level 1 — a highest posterior density (HPD) credible set for @ is defined 
to be the event 
C(x) = {0 : px (9) 2 Cyt (7.7) 

where Cg is chosen so that Seve) pe(O)d\ > 1—a. When p,(@) has a 
continuous c.d.f., we can replace > in (7.6) and (7.7) by =. An HPD 
credible set is often an interval with the shortest length among all credible 
intervals of the same level (Exercise 40). 


The Bayesian credible sets and the confidence sets we have discussed 
so far are very different in terms of their meanings and interpretations, 
although sometimes they look similar. In a credible set, x is fixed and 
@ is considered random and the probability statement in (7.6) is w.r.t. 
the posterior probability P9,. On the other hand, in a confidence set 
@ is nonrandom (although unknown) but X is considered random, and 
the significance level is w.r.t. P(@ € C(X)), the probability related to the 
distribution of X. The set C'(X) in (7.7) is not necessarily a confidence set 
with significance level 1 — a. 

When 7(@) is constant, which is usually an improper prior, the HPD 
credible set C(x) in (7.7) is related to the idea of maximizing likelihood (a 
non-Bayesian approach introduced in §4.4; see also §7.3.2), since p,(@) = 
fo(x)/m(za) is proportional to fg(a) = (0), the likelihood function. In such 
a case C(X) may be a confidence set with significance level 1 — a. 


Example 7.11. Let X1,..., Xp be i.i.d. as N(0,07) with an unknown 6 € R 
and a known o?. Let (0) be the p.d.f. of N(~o,0@) with known po and 
o3. Then, px(0) is the p.d.f. of N(u«(x),c?) (Example 2.25), where ps (x) 
and c? are given by (2.25), and the HPD credible set in (7.7) is 


C(x) = {o »e(0-He(@)P/(2e?) > cov Orch 
= {o :|0 — pa(x)] < V2e[— log(caV2re)]"/?} 


Let ® be the standard normal c.d.f. The quantity V2c[— log(caV27ce)]\/? 
must be ¢z1_q/2, Where zq = ®~'(a), since it is chosen so that Pg),(C(x)) = 


482 7. Confidence Sets 


1—aand Po, = N(x (x),c?). Therefore, 


C(x) = [u« (x) — cz1-a/2, Hx() + CZ1-a/2]- 


If we let 02 — oo, which is equivalent to taking the Lebesgue measure as 
the (improper) prior, then p1.(x%) = %, c? = 0? /n, and 


C(x) = [£- 021~a/2/Vn, E+o2%-_a/2/Vn], 


which is the same as the confidence interval in Example 2.31 for @ with 
confidence coefficient 1 — a. Although the Bayesian credible set coincides 
with the classical confidence interval, which is frequently the case when a 
noninformative prior is used, their interpretations are still different. 


More details about Bayesian credible sets can be found, for example, in 
Berger (1985, §4.3). 


7.1.4 Prediction sets 


In some problems the quantity of interest is the future (or unobserved) value 
of a random variable €. An inference procedure about a random quantity 
instead of an unknown nonrandom parameter is called prediction. If the 
distribution of € is known, then a level 1—a prediction set for € is any event 
C satisfying Pe(€ € C) > 1— a. In applications, however, the distribution 
of € is usually unknown. 


Suppose that the distribution of € is related to the distribution of 
a sample X from which prediction will be made. For instance, X = 
(X1,..., Xn) is the observed sample and € = X,,41 is to be predicted, where 
X1,...;Xn,Xn41 are iid. random variables. A set C(X) depending only 
on the sample X is said to be a level 1 — a@ prediction set for € if 


inf P(E € O(X)) >1-a, 


where P is the joint distribution of (€,X) and P contains all possible P. 


Note that prediction sets are very similar and closely related to confi- 
dence sets. Hence, some methods for constructing confidence sets can be 
applied to obtained prediction sets. For example, if R(X,&) is a pivotal 
quantity in the sense that its distribution does not depend on P, then a 
prediction set can be obtained by inverting c; < R(X,€) < cg. The follow- 
ing example illustrates this idea. 


Example 7.12. Many prediction problems encountered in practice can 
be formulated as follows. The variable € to be predicted is related to a 
vector-valued covariate ¢ (called predictor) according to E(€\¢) = ¢7G, 


7.1. Construction of Confidence Sets 483 


where ( is a p-vector of unknown parameters. Suppose that at ¢ = Z%, 
we observe € = Xj, 7 = l,...,n, and X;,’s are independent. Based on 
(X1, 21), ---;(Xn, Zn), we would like to construct a prediction set for the 
value of € = Xo when ¢ = Z € R(Z), where Z is the n x p matrix whose 
ith row is the vector Z;. The Z;’s are either fixed or random observations 
(in the latter case all probabilities and expectations given in the following 
discussion are conditional on Zp, Z4,..., Zn). 


Assume further that X = (X1,..., Xn) = Nn(ZG, 071) follows a normal 
linear model and is independent of Xp = N(Zj 3,07). Let B be the LSE 
of 8, 6? = ||X — Z||?2/(n — 1), and ||Zoll2, = 23(Z7 Z)~ Zo, where r is the 
rank of Z. Then . 

Xo — 258 
5/1 + IIZoll 
has the t-distribution t,_, and, therefore, is a pivotal quantity. This is 
because Xo and Z§ B are independently normal, 


R(X, Xo) = 


E(Xo— Z58)=0, — Var(Xo — 258) = 07(1 + ||ZollZ) 
(n — r)6? has the chi-square distribution y2_,, and Xo, 293, and 6? are 
independent (Theorem 3.8). A level 1— a prediction interval for Xo is then 


Zo 3 — tn—r,a/2F\/ Tle |ZollZ,, ZoB+ tn—ra/2O\/ dae a ’ (7.8) 


where t,_;,q is the (1 — a)th quantile of the t-distribution t,_,. 
To compare prediction sets with confidence sets, let us consider a con- 
fidence interval for E(Xo) = Z§ 3. Using the pivotal quantity 


258 — 256 

G\|Zollz 
we obtain the following confidence interval for 75 @ with confidence coeffi- 
cient 1 — a: 


R(X, 258) = 


[ 258 = tr-ra/26llZollz, 258 + tn—na/2dllZollz]. (7.9) 


Since a random variable is more variable than its average (an unknown pa- 
rameter), the prediction interval (7.8) is always longer than the confidence 
interval (7.9), although each of them covers the quantity of interest with 
probability 1— a. In fact, when ||Zo||?, 0 as n — ov, the length of the 
confidence interval (7.9) tends to 0 a.s., whereas the length of the prediction 
interval (7.8) tends to a positive constant a.s. Il 


Because of the similarity between confidence sets and prediction sets, 
in the rest of this chapter we do not discuss prediction sets in detail. Some 
examples are given in Exercises 30 and 31. 


484 7. Confidence Sets 


7.2 Properties of Confidence Sets 


In this section, we study some properties of confidence sets and introduce 
several criteria for comparing them. 


7.2.1 Lengths of confidence intervals 


For confidence intervals of a real-valued @ with the same confidence coef- 
ficient, an apparent measure of their performance is the interval length. 
Shorter confidence intervals are preferred, since they are more informative. 
In most problems, however, shortest-length confidence intervals do not ex- 
ist. A common approach is to consider a reasonable class of confidence 
intervals (with the same confidence coefficient) and then find a confidence 
interval with the shortest length within the class. 


When confidence intervals are constructed by using pivotal quantities 
or by inverting acceptance regions of tests, choosing a reasonable class of 
confidence intervals amounts to selecting good pivotal quantities or tests. 
Functions of sufficient statistics should be used, when sufficient statistics 
exist. In many problems pivotal quantities or tests are related to some 
point estimators of 6. For example, in a location family problem (Example 
7.1), a confidence interval for 6 = py is often of the form [6 —c, 6+ c], where 
6 is an estimator of @ and c is a constant. In such a case a more accurate 
estimator of # should intuitively result in a better confidence interval. For 
instance, when Xj,...,X, are iid. N(,1), it can be shown (exercise) that 
the interval [X — c,, X +c] is better than the interval [X, — c2, X1 + cg] in 
terms of their lengths, where c,;’s are chosen so that these confidence inter- 
vals have confidence coefficient 1 — a. However, we cannot have the same 
conclusion when X;,’s are from the Cauchy distribution C(u,1) (Exercise 
32). The following is another example. 


Example 7.13. Let Xj,...,.X, be iid. from the uniform distribution 
U(0,@) with an unknown 6 > 0. A confidence interval for @ of the form 
[b-'X(,),a~'X(p)] is derived in Example 7.2, where a and b are constants 
chosen so that this confidence interval has confidence coefficient 1—a. An- 
other confidence Tucival obtained by applying Proposition 7.1 with T = X 
is of the form [by 'X,a;'X], where X = e([]"_, X;)!/". We now argue that 
when n is large enough, the former has a shorter length than the latter. 
Note that /n(X — 0)/0 +4 N(0,1). Thus, 


P (tg) “X05 (1+ ge) 8) =P (es HS fh) 1-0 


for some constants c and d. This means that ay © 1+c//n, b1 = 1+d/VJn, 


and the length of [b;'X,a,!X] converges to 0 a.s. at the rate n~!/?. On 


7.2. Properties of Confidence Sets 485 


the other hand, 
P (1 d)~1 c\-1 = coe Xm! 2 ad 
+ £) "Xm <O< (14+ 4) Xi) = P(E < <4) 1-0 


for some constants c and d, since n(X(,) —9)/@ has a known limiting distri- 
bution (Example 2.34). This means that the length of [b~'X(,),a7'X,(n)| 
converges to 0 a.s. at the rate n~! and, therefore, [b~'X(n),a~'X(n)] is 
shorter than [by 'X,a,'X] for sufficiently large n a.s. 


Similarly, one can show that the confidence interval based on the pivotal 
quantity X/0 is not as good as [b-1X(,,),a~1X(n)| in terms of their lengths. 

Thus, it is reasonable to consider the class of confidence intervals of the 
form [b7'X(,),a7'X(y)] subject to P(b1 Xn) < 6 < a1 X(p)) = 1—a. 
The shortest-length interval within this class can be derived as follows. 
Note that X(,)/0 has the Lebesgue p.d.f. nx”~1I(o,1)(x). Hence 


b 
l-a= P(X (n) <O0< a~*X(n)) = / nz” de = b — a”. 

This implies that 1 > b>a>0 and aa = (eer, Since the length of the 

interval be Xe, a" Xn] is v(a, b) = Xn) (at = mae 


dw 1 1 da ees iy 
w= %o (ge ag) =X gaa <O 


Hence the minimum occurs at b = 1 (a = a!/"), This shows that the 
shortest-length interval is [X(n),a7/"Xm)]. 0 


As Example 7.13 indicates, once a reasonable class of confidence inter- 
vals is chosen (using some good estimators, pivotal quantities, or tests), we 
may find the shortest-length confidence interval within the class by directly 
analyzing the lengths of the intervals. For a large class of problems, the 
following result can be used. 


Theorem 7.3. Let 6 be a real-valued parameter and T(X) be a real-valued 
statistic. 

(i) Let U(X) be a positive statistic. Suppose that (I — 6)/U is a pivotal 
quantity having a Lebesgue p.d.f. f that is unimodal at xp € FR in the sense 
that f(x) is nondecreasing for x < xp and f(x) is nonincreasing for x > ao. 
Consider the following class of confidence intervals for @: 


b 
c= {0 —at aERDER, | Hieyir=1—o}, (7.10) 


If [JT — b,U,T —a,U] € C, flax) = f(b.) > 0, and a, < ao < b,, then the 
interval [T — b,U,T — a,U] has the shortest length within C. 


486 7. Confidence Sets 


(ii) Suppose that T > 0, 6 > 0, T/0 is a pivotal quantity having a Lebesgue 
p.df. f, and that 2? f(a) is unimodal at x9. Consider the following class of 
confidence intervals for 6: 


b 
c= {on}: a>0,0>0, [ Hieyir=1—a}. (7.11) 


If [b;!T,a,1T] € C, a? f(a.) = b? f(b.) > 0, and a, < x < bx, then the 
interval [b; 17, a,‘T] has the shortest length within C. 

Proof. We prove (i) only. The proof of (ii) is left as an exercise. Note that 
the length of an interval in C is (b— a)U. Thus, it suffices to show that if 


a<bandb—a < b, — ax, then J? f(w)dx <1-—a. Assume that a < b, 
b—a< by —a,, and a < a, (the proof for a > a, is similar). 


If b< a,, then a <b <a, < 2 and 


b bx 
[ flee < fa.)b-a) < f(a,)(b.— a.) < f fle)de =1-<, 


ax 


where the first inequality follows from the unimodality of f, the strict in- 
equality follows from b—a < b, —a, and f(a.) > 0, and the last inequality 
follows from the unimodality of f and the fact that f(a.) = f(b«). 


If b> ax, then a <a, <b < by. By the unimodality of f, 


bx. 


[flare < afar a) and f(v)de > f(be)(be — b). 


b 


ie (a)de = | : faye +f flayae — f * fede 


Ax bs 
=l-a+ f(a)dx — : f(x)dx 
< 1l-at f(ax)(ae — a) — (bx) (bx — 6) 
=1-a+ f(a:)[(ee— a) - 0-9] 
= 1-at f(ax)[(0— a) — (bs — ax) 
<l-a. 1 


Then 


Example 7.14. Let X1,..., Xn be ii.d. from N(p, 07) with unknown p and 
o”. Confidence intervals for @ = yz using the pivotal quantity \/n(X — p)/S 
form the class C in (7.10) with f being the p.d-f. of the t-distribution t,—1, 
which is unimodal at 2p = 0. Hence, we can apply Theorem 7.3(i). Since f 
is symmetric about 0, f(a.) = f(b.) implies a,. = —b, (exercise). Therefore, 
the equal-tail confidence interval 


[X = tr—1,0/28/Vn, X + tr—1,0/25/Vn | (7.12) 


7.2. Properties of Confidence Sets 487 


has the shortest length within C. 


If 6 = p and o? is known, then we can replace S$ by o and f by the 
standard normal p.df. (i.e., use the pivotal quantity //n(X — w)/o instead 
of /n(X — p)/S). The resulting confidence interval is 


[X — ®7'(1 — a/2)o/J/n, X + ®1(1— a/2)0/Vn], (7.13) 


which is the shortest interval of the form [X — b, X — a] with confidence 
coefficient 1—a. The difference in length of the intervals in (7.12) and (7.13) 
is a random variable so that we cannot tell which one is better in general. 
But the expected length of the interval (7.13) is always shorter than that of 
the interval (7.12) (exercise). This again shows the importance of picking 
the right pivotal quantity. 

Consider next confidence intervals for @ = 0? using the pivotal quantity 
(n — 1)S?/o?, which form the class C in (7.11) with f being the p.d-f. of 
the chi-square distribution y?_,. Note that x? f(x) is unimodal, but not 
symmetric. By Theorem 7.3(ii), the shortest-length interval within C is 


[by *(n — 1)8?, az! (n — 1)S], (7.14) 


where a, and b, are solutions of a? f(a.) = b? f(b.) and de f(x)dx = 1—-a. 
Numerical values of a, and b, can be obtained (Tate and Klett, 1959). Note 
that this interval is not equal-tail. 

If 9 =o? and pp is known, then a better pivotal quantity is T/o?, where 
T = >, (X;—p)?. One can show (exercise) that if we replace (n—1)S? by 
T and f by the p.d.f. of the chi-square distribution y2, then the resulting 
interval has shorter expected length than that of the interval in (7.14). 

Suppose that we need a confidence interval for 6 = 0 when p is unknown. 
Consider the class of confidence intervals 


[evn —TS,a7/?Vn—18 


with L f(x)dx = 1—a and f being the p.d-f. of y?_,. The shortest-length 
interval, however, is not the one with the endpoints equal to the square 
roots of the endpoints of the interval (7.14) (Exercise 36(c)). I 


Note that Theorem 7.3(ii) cannot be applied to obtain the result in 
Example 7.13 unless n = 1, since the p.d.f. of X(,)/0 is strictly increasing 
when n > 1. A result similar to Theorem 7.3, which can be applied to 
Example 7.13, is given in Exercise 38. 


The result in Theorem 7.3 can also be applied to justify the idea of HPD 
credible sets in Bayesian analysis (Exercise 40). 

If a confidence interval has the shortest length within a class of con- 
fidence intervals, then its expected length is also the shortest within the 


488 7. Confidence Sets 


same class, provided that its expected length is finite. In a problem where 
a shortest-length confidence interval does not exist, we may have to use 
the expected length as the criterion in comparing confidence intervals. For 
instance, the expected length of the interval in (7.13) is always shorter than 
that of the interval in (7.12), whereas the probability that the interval in 
(7.12) is shorter than the interval in (7.13) is positive for any fixed n. An- 
other example is the interval [X(,), a“ VX (,)] in Example 7.13. Although 
we are not able to say that this interval has the shortest length among all 
confidence intervals for 9 with confidence coefficient 1 — a, we can show 
that it has the shortest expected length, using the results in Theorems 7.4 
and 7.6 (§7.2.2). 


For one-sided confidence intervals (confidence bounds) of a real-valued 
0, their lengths may be infinity. We can use the distance between the 
confidence bound and @ as a criterion in comparing confidence bounds, 
which is equivalent to comparing the tightness of confidence bounds. Let 
6 jd =1,2, be two lower confidence bounds for 0 with the same confidence 
coefficient. If 6, — 0 > 0, — @ is always true, then 6, > @, and @, is tighter 
(more informative) than 9. Again, since @; are random, we may have to 
consider E(@; — @) and choose @, if E(@,) > E(@2). As a specific example, 
consider i.i.d. X1,..., Xn from N(0,1). If we use the pivotal quantity X — p, 
then 0, = X —®~1(1—a)/V/n. If we use the pivotal quantity X, — yu, then 
0. = X, — ®-'(1—a). Clearly E(@,) > E(@,). Although @, is intuitively 
preferred, 8, < 9 with a positive probability for any fixed n > 1. 

Some ideas discussed previously can be extended to the comparison of 
confidence sets for multivariate 9. For bounded confidence sets in R*, for 
example, we may consider their volumes (Lebesgue measures). However, in 
multivariate cases it is difficult to compare the volumes of confidence sets 
with different shapes. Some results about expected volumes of confidence 
sets are given in Theorem 7.6. 


7.2.2 UMA and UMAU confidence sets 


For a confidence set obtained by inverting the acceptance regions of some 
UMP or UMPU tests, it is expected that the confidence set inherits some 
optimality property. 


Definition 7.2. Let 9 € O be an unknown parameter and 0’ be a subset 
of © that does not contain the true parameter value @. A confidence set 
C(X) for 6 with confidence coefficient 1— a is said to be 8’-uniformly most 
accurate (UMA) if and only if for any other confidence set C(X) with 
significance level 1 — a, 


P(# EC(X)) < P(O EC (X)) forall EO’. (7.15) 


7.2. Properties of Confidence Sets 489 


C(X) is UMA if and only if it is @/-UMA with 0’ = {6}*. 11 


The probabilities in (7.15) are probabilities of covering false values. In- 
tuitively, confidence sets with small probabilities of covering wrong param- 
eter values are preferred. The reason why we sometimes need to consider a 
©’ different from {0}° (the set containing all false values) is that for some 
confidence sets, such as one-sided confidence intervals, we do not need to 
worry about the probabilities of covering some false values. For example, 
if we consider a lower confidence bound for a real-valued 6, we are assert- 
ing that @ is larger than a certain value and we only need to worry about 
covering values of 9 that are too small. Thus, 0’ = {@ € 0:6’ < 6}. A 
similar discussion leads to the consideration of 6’ = {0’ € ©: 0’ > 6} for 
upper confidence bounds. 


Theorem 7.4. Let C(X) be a confidence set for 9 obtained by inverting the 
acceptance regions of nonrandomized tests T, for testing Hp : @ = 0 versus 
H, : 6 € Qg,. Suppose that for each 49, Ty, is UMP of size a. Then C(X) 
is O’-UMA with confidence coefficient 1 — a, where 0’ = {6 : 0 € Og}. 
Proof. The fact that C(X) has confidence coefficient 1 — a@ follows from 
Theorem 7.2. Let Ci(X) be another confidence set with significance level 
1—a. By Proposition 7.2, the test T19,(X) = 1—I4,(6,)(X) with A1(9) = 
{x : 09 € Ci(x)} has significance level a for testing Hp : 0 = 09 versus 
Hy, :0€ Qg,. For any 0’ € 0’, 8 € Ow» and, hence, the population P is in 
the family defined by H, : 6 € Og. Thus, 


P(6’ € C(X)) =1- P(To(X) =1) 
1— P(Ti(X) =1) 
= P(’ €C,(X)), 


IA 


where the first equality follows from the fact that Tg is nonrandomized and 
the inequality follows from the fact that Tg is UMP. 4 


Theorem 7.4 can be applied to construct UMA confidence bounds in 
problems where the population is in a one-parameter parametric family 
with monotone likelihood ratio so that UMP tests exist (Theorem 6.2). 
It can also be applied to a few cases to construct two-sided UMA confi- 
dence intervals. For example, the confidence interval [X(,), OE TEX. (n)] in 
Example 7.13 is UMA (exercise). 


As we discussed in §6.2, in many problems there are UMPU tests but 
not UMP tests. This leads to the following definition. 


Definition 7.3. Let 6 € © be an unknown parameter, 0’ be a subset of 
© that does not contain the true parameter value 0, and 1 — a be a given 
significance level. 


490 7. Confidence Sets 


(i) A level 1 — a confidence set C(X) is said to be O’-unbiased (unbiased 
when 0’ = {0}°) if and only if P(@’ € C(X)) <1— a for all  € 0”. 

(ii) Let C(X) be a O/-unbiased confidence set with confidence coefficient 
1—a. If (7.15) holds for any other O’-unbiased confidence set C1(X) 
with significance level 1 — a, then C(X) is 0’-uniformly most accurate 
unbiased (UMAU). C(X) is UMAU if and only if it is O’-UMAU with 
o’/={o}*. I 


Theorem 7.5. Let C(X) be a confidence set for @ obtained by inverting 
the acceptance regions of nonrandomized tests Tg, for testing Ho : 6 = 00 
versus H, : 0 € Qg,. If Tg, is unbiased of size a for each 00, then C(X) is 
0/-unbiased with confidence coefficient 1 — a, where 0’ = {60 : 0 € Og}; if 
To, is also UMPU for each 09, then C(X) is O’-UMAU. 1 


The proof of Theorem 7.5 is very similar to that of Theorem 7.4. 

It follows from Theorem 7.5 and the results in §6.2 that the confidence 
intervals in (7.12), (7.13), and (7.14) are UMAU, since they can be obtained 
by inverting acceptance regions of UMPU tests (Exercise 23). 


Example 7.15. Consider the normal linear model in Example 7.9 and the 
parameter 0 = 173, where 1 € R(Z). From 86.2.3, the nonrandomized test 
with acceptance region 


A(69) = {x £178 — 09 > thr (Z7Z)-1SSR/(n — ry} 


is UMPU with size a for testing Ho : 6 = 8 versus H; : 6 < 09, where B is 
the LSE of @ and tn_,,. is the (1 —a)th quantile of the t-distribution t,_,. 
Inverting A(@) we obtain the following 6’-UMAU upper confidence bound 
with confidence coefficient 1— a and 0! = (6,00): 


0 =I B —ty_raVlt(Z7Z)-ISSR/(n— 1). 
A UMAU confidence interval for 0 can be similarly obtained. 
If 6 = LG with L described in Example 7.9 and s > 1, then @ is multi- 


variate. It can be shown that the confidence set derived in Example 7.9 is 
unbiased (exercise), but it may not be UMAU. I 


The volume of a confidence set C(X) for 6 € R* when X = z is defined 
to be vol(C(z)) = Seve) dé’, which is the Lebesgue measure of the set 
C(x) and may be infinite. In particular, if @ is real-valued and C(X) = 
(A(X), 0(X)] is a confidence interval, then vol(C(z)) is simply the length of 
C(x). The next result reveals a relationship between the expected volume 
(length) and the probability of covering a false value of a confidence set 
(interval). 


7.2. Properties of Confidence Sets 491 


Theorem 7.6 (Pratt’s theorem). Let X be a sample from P and C(X) be 
a confidence set for 0 € R*. Suppose that vol(C(x)) = Seve) d0’ is finite 
a.s. P. Then the expected volume of C(X) is 


Blvo(C(x))] = f P(#! € O(X)) do’. (7.16) 
040! 
Proof. By Fubini’s theorem, 


Blvol(C(X))] = f vol(C(X))aP 


“IL 


= [ P(e ccxyyae 


= I P(6' € C(X)) a0’. 
040! 


dP(x) 


do’ 


This proves the result. 


It follows from Theorem 7.6 that if C(X) is UMA (or UMAU) with 
confidence coefficient 1—a, then it has the smallest expected volume among 
all confidence sets (or all unbiased confidence sets) with significance level 
1—a. For example, the confidence interval (7.13) in Example 7.14 (when o? 
is known) or [X(n), ax (n)] in Example 7.13 has the shortest expected 
length among all confidence intervals with significance level 1 — a; the 
confidence interval (7.12) or (7.14) has the shortest expected length among 
all unbiased confidence intervals with significance level 1 — a. 


7.2.3 Randomized confidence sets 


Applications of Theorems 7.4 and 7.5 require that C(X) be obtained by 
inverting acceptance regions of nonrandomized tests. Thus, these results 
cannot be directly applied to discrete problems. In fact, in discrete prob- 
lems inverting acceptance regions of randomized tests may not lead to a 
confidence set with a given confidence coefficient. Note that randomization 
is used in hypothesis testing to obtain tests with a given size. Thus, the 
same idea can be applied to confidence sets, i.e., we may consider random- 
ized confidence sets. 


492 7. Confidence Sets 


Suppose that we invert acceptance regions of randomized tests Tg, that 
reject Ho : 6 = 09 with probability Tp,(#) when X = x. Let U be a random 
variable that is independent of X and has the uniform distribution U(0, 1). 
Then the test T»,(X,U) = I(u,1|(To,) has the same power function as Ty, 
and is “nonrandomized” if U is viewed as part of the sample. Let 


Ay (90) = {(@, U) : U 2 To(x)} 


be the acceptance region of Ty,(X,U). If Ts, has size a for all 09, then 
inverting Ay(@) we obtain a confidence set 


C(X,U) = {6: (X,U) € Au(0)} 
having confidence coefficient 1 — a, since 
P(@ EC(X, U)) = E([P(u > To(X)|X)] = El - To(X)]. 


If To, is UMP (or UMPU) for each 9, then C(X,U) is UMA (or UMAU). 
However, C(X,U) is a randomized confidence set since it is still random 
when we observe X = 2. 


When T%, is a function of an integer-valued statistic, we can use the 
method in the following example to derive C(X,U). 


Example 7.16. Let X1,....X, be iid. binary random variables with p = 
P(X; = 1). The confidence coefficient of (p,1] may not be 1 — a, where p 
is given by (7.5). : 7 

From Example 6.2 and the previous discussion, a randomized UMP test 
for testing Hp : p = po versus H, : p > po can be constructed based on 
Y= yo 3 X; and U, a random variable that is independent of Y and has 
the uniform distribution U(0,1). Since Y is integer-valued and U € (0,1), 
W =Y +U is equivalent to (Y,U). It can be shown (exercise) that W has 
the following Lebesgue p.d.f.: 


fy(w) = ("olla — py" oes), (7.17) 


where [w] is the integer part of w, and that the family {f, : p € (0,1)} has 
monotone likelihood ratio in W. It follows from Theorem 6.2 that the test 
Tp (Y,U) = L(e(p9) n+1)(W) is UMP of size a for testing Ho : p = po versus 


Hy: p> po, where a = bees fo (w)dw. Since er fp(w)dw is increasing 


in p (Lemma 6.3), inverting the acceptance regions of T;,(Y,U) leads to 
C(X,U) = {p : ee fp(w)dw > a} = |p,,1], where p, is the solution of 


n+1 n 
i: ( ola — py" ldw =a 
Y+U [w] 


7.2. Properties of Confidence Sets 493 


(p, =O0if Y =OandU <1-a;p, =1if Y =n andU >1-a). The 
lower confidence bound P has confidence coefficient 1 — a and is O/-UMA 
with 0’ =(0,p). I 


Using a randomized confidence set, we can achieve the purpose of ob- 
taining a confidence set with a given confidence coefficient as well as some 
optimality properties such as UMA, UMAJU, or shortest expected length. 
On the other hand, randomization may not be desired in practical problems. 


7.2.4 Invariant confidence sets 


Let C(X) be a confidence set for 0 and g be a one-to-one transformation of 
X. The invariance principle requires that C(x) change in a specified way 
when wz is transformed to g(x), where « € X and X is the range of X. 


Definition 7.4. Let G be a group of one-to-one transformations of X such 
that P is invariant under G (Definition 2.9). Let 0 = 6(P) be a parameter 
with range ©. Assume that g(@) = 0(P4:x)) is well defined for any g € G, 
ie., g is a transformation on © induced by g (g = g given in Definition 2.9 
if P is indexed by 6). 

(i) A confidence set C'(X) is invariant under G if and only if 6 € C(x) is 
equivalent to g(@) € C(g(x)) for every 7 €X,0€ O, and geG. 

(ii) C(X) is O’-uniformly most accurate invariant (UMAI) with confidence 
coefficient 1 — a if and only if C(X) is invariant with confidence coefficient 
1 —a and (7.15) holds for any other invariant confidence set C\(X) with 
significance level 1 — a. C(X) is UMAT if and only if it is O’/-UMAI with 
o’/={o}*. T 


Example 7.17. Consider the confidence intervals in Example 7.14. Let 
G = {gre:7r >0,c € R} with g,-(4) = (ra. +¢,...,7@n +0). Let 6 = py. 
Then G;,<(u,07) = (ru +c¢,r?o7) and g(u) = rut+ec. Clearly, confidence 
interval (7.12) is invariant under G. 

When @? is known, the family P is not invariant under G and we consider 
Gi = {g1,.:¢ € R}. Then both confidence intervals (7.12) and (7.13) are 
invariant under G}. 


Suppose now that 6 = 07. For g,.< € G, g(a”) = r207. Hence confidence 
interval (7.14) is invariant under G. I 


If a confidence set C(X) is UMA and invariant, then it is UMAI. If 
C(X) is UMAU and invariant, it is not so obvious whether it is UMAT, 
since a UMAI confidence set (if it exists) is not necessarily unbiased. The 
following result may be used to construct a UMAI confidence set. 


Theorem 7.7. Suppose that for each 0) € ©, A(6o) is the acceptance 


494 7. Confidence Sets 


region of a nonrandomized UMPI test of size a for Ho : 6 = @ versus Hy : 
0 € Og, under Go, and that for any 69 and g € Go,, g, the transformation 
on © induced by g, is well defined. If C(X) = {0: x € A(O)} is invariant 
under G, the smallest group containing UgceGe, then it is @/-UMAI with 
confidence coefficient 1 — a, where 0’ = {6:96 Og}. I 


The proofs of Theorem 7.7 and the following result are given as exercises. 


Proposition 7.3. Let P be a parametric family indexed by 0 and G be 
a group of transformations such that g is well defined by P3(9) = Pgcx)- 
Suppose that, for any 0, 6’ € O, there is ag € G such that g(0) = 0’. Then, 
for any invariant confidence set C(X), P(@ € C(X)) is a constant. I 


Example 7.18. Let Xj,...,X» be iid. from N(y,07) with unknown pu 
and o?. Consider the problem of setting a lower confidence bound for 
0 = p/o and G = {g, : r > 0} with g,(#) = ra. From Example 6.17, a 
nonrandomized UMPI test of size a for Ho : 6 = 8 versus H, : 0 > 69 has 
the acceptance region A(09) = {x : t(x) < c(Oo)}, where t(X) = /nX/S 
and c(@) is the (1—a)th quantile of the noncentral t-distribution t,—1(./n). 
Applying Theorem 7.7 with Gg, = G for all 09, one can show (exercise) that 
the solution of f, ial fo(u)du = a is a O/-UMAT lower confidence bound for 
@ with confidence coefficient 1 — a, where fg is the Lebesgue p.d.f. of the 
noncentral t-distribution t,-1(./n6) and ©’ = (—co,@). I 


Example 7.19. Consider again the confidence intervals in Example 7.14. 
In Example 7.17, confidence interval (7.12) is shown to be invariant under 
G = {grc:7 > 0,c€ R} with g,-(a) = (raitc,...,r@n +c). Although con- 
fidence interval (7.12) is UMAU, it is not obvious whether it is UMAI. This 
interval can be obtained by inverting A(jo)={2: |X — p10] <tn—1,0/25/Vn}, 
which is the acceptance region of a nonrandomized test UMP among unbi- 
ased and invariant tests of size a for Ho : uw = fo versus Hy : uw ~ fo, under 
Guo = {Pro 17 > OF with Ryyo(%) = (r(e1 — Ho) + Ho, «+5 T(r — Ho) + Ho) 
(exercise). Note that the testing problem Ho : s = fo versus Hy : pw # [U0 is 
not invariant under G. Since G is the smallest group containing UpjpeRGuo 
(exercise), by Theorem 7.7, interval (7.12) is UMA among unbiased and 
invariant confidence intervals with confidence coefficient 1 — a, under G. 

Using similar arguments one can show (exercise) that confidence inter- 
vals (7.13) and (7.14) are UMA among unbiased and invariant confidence 
intervals with confidence coefficient 1— a, under G, (in Example 7.17) and 
G, respectively. I 


When UMPI tests are randomized, one can construct randomized UMAI 
confidence sets, using the techniques introduced in Theorem 7.7 and §7.2.3. 


7.3. Asymptotic Confidence Sets 495 


7.3 Asymptotic Confidence Sets 


In some problems, especially in nonparametric problems, it is difficult to 
find a reasonable confidence set with a given confidence coefficient or sig- 
nificance level 1 — a. A common approach is to find a confidence set whose 
confidence coefficient or significance level is nearly 1 — a when the sample 
size n is large. A confidence set C(X) for 6 has asymptotic significance level 
1—a if liminf, P(@ € C(X)) > 1—-a for any P € P (Definition 2.14). If 
limn—o P(@ € C(X)) =1—-—a for any P € P, then C(X) isa l—a asymp- 
totically correct confidence set. Note that asymptotic correctness is not the 
same as having limiting confidence coefficient 1 — a (Definition 2.14). 


7.3.1 Asymptotically pivotal quantities 


A known Borel function of (X,6), Rn(X,0), is said to be asymptotically 
pivotal if and only if the limiting distribution of #,(X,6) does not depend 
on P. Like a pivotal quantity in constructing confidence sets (§7.1.1) with 
a given confidence coefficient or significance level, an asymptotically pivotal 
quantity can be used in constructing asymptotically correct confidence sets. 


Most asymptotically pivotal quantities are of the form Vet (On — 9), 
where 6, is an estimator of # that is asymptotically normal, i.e., 


V;, 1/26, — 8) +a Ne(0, Ie), (7.18) 


and v., is an estimator of the asymptotic covariance matrix V,, and is consis- 
tent according to Definition 5.4. The resulting 1— a asymptotically correct 
confidence sets are of the form 


O(X) = (8: [V7 6, — OI? < xzah; (7.19) 


where yj. is the (1 — a)th quantile of the chi-square distribution xj. If 0 
is real-valued (k = 1), then C(X) in (7.19) is a 1—a@ asymptotically correct 
confidence interval. When k > 1, C(X) in (7.19) is an ellipsoid. 


Example 7.20 (Functions of means). Suppose that X1,..., X, are i.i.d. 
random vectors having a c.d.f. F on R@ and that the unknown parameter 
of interest is 06 = g(w), where w = E(X,) and g is a known differentiable 
function from R? to R*, k < d. From the CLT, Theorem 1.12, and the 
result in §5.5.1, (7.18) holds with 6, = g(X) and V,, given by (5.108). Thus, 
C(X) in (7.19) is a 1 — a asymptotically correct confidence set for 6. I 


Example 7.21 (Statistical functionals). Suppose that X1,..., Xp are i.i.d. 
random vectors having a c.d.f. F on R@ and that the unknown parameter 
of interest is 90 = T(F’), where T is a k-vector-valued functional. Let F, be 


496 7. Confidence Sets 


the empirical c.d.f. defined by (5.1) and 6, = T(F,). Suppose that each 
component of T is @..-Hadamard differentiable with an influence function 
satisfying (5.40) and that the conditions in Theorem 5.15 hold. Then, by 
Theorems 5.5 and 5.15 and the discussions in §5.2.1, (7.18) holds with 
V, given by (5.110) and C(X) in (7.19) is a 1 — a asymptotically correct 
confidence set for #. 


Example 7.22 (Linear models). Consider linear model (3.25): X = ZG+e, 
where € has i.i.d. components with mean 0 and variance 0”. Assume that 
Z is of full rank and that the conditions in Theorem 3.12 hold. It follows 
from Theorem 1.9(iii) and Theorem 3.12 that (7.18) holds with 6, = 2 and 
Vn = (n—p)SSR(Z7Z)~! (see §5.5.1). Thus, a 1 — a asymptotically 
correct confidence set for @ is 


C(X) = {6: (6 - B)"(Z7Z)(B - B) < x2,,SSR/(n— p)}. 


Note that this confidence set is different from the one in Example 7.9 derived 
under the normality assumption one. I 


The problems in the previous three examples are nonparametric. The 
method of using asymptotically pivotal quantities can also be applied to 
parametric problems. Note that in a parametric problem where the un- 
known parameter @ is multivariate, a confidence set for @ with a given 
confidence coefficient may be difficult or impossible to obtain. 

Typically, in a given problem there exist many different asymptotically 
pivotal quantities that lead to different 1 — a asymptotically correct con- 
fidence sets for 9. Intuitively, if two asymptotic confidence sets are con- 
structed using (7.18) with two different estimators, 61, and bon, and if 61» 
is asymptotically more efficient than 62», (84.5.1), then the confidence set 
based on bin should be better than the one based on Bon in some sense. 
This is formally stated in the following result. 


Proposition 7.4. Let C;(X), 7 = 1,2, be the confidence sets given in 
(7.19) with 6, = bin and V,, = Vins j = 1,2, respectively. Suppose that for 
each j, (7.18) holds for 0; and Vjn is consistent for Vjn, the asymptotic 
covariance matrix of bjn- If Det(Vin) < Det(V2,) for sufficiently large n, 
where Det(A) is the determinant of A, then 


P(vol(C1(X)) < vol(C2(X))) > 1. 


Proof. The result follows from the consistency of Vin and the fact that 
the volume of the ellipsoid C'(X) defined by (7.19) is equal to 


mk/2(y2_k/2(Det(V,,)]!/2 
vo(O(X) = Me Oe. : 


7.3. Asymptotic Confidence Sets 497 


If 6; is asymptotically more efficient than 627, (4.5.1), then Det(Vin) < 
Det(V2,). Hence, Proposition 7.4 indicates that a more efficient estimator 
of @ results in a better confidence set of the form (7.19) in terms of volume. 
If 6, is asymptotically efficient (optimal in the sense of having the smallest 
asymptotic covariance matrix; see Definition 4.4), then the confidence set 
C(X) in (7.19) is asymptotically optimal (in terms of volume) among the 
confidence sets of the form (7.19). 

Asymptotically correct confidence sets for @ can also be constructed by 
inverting acceptance regions of asymptotic tests for testing Ho : 6 = 0 
versus some H,. If asymptotic tests are constructed using asymptotically 
pivotal quantities (see §6.4.2, §6.4.3, and §6.5.4), the resulting confidence 
sets are almost the same as those based on asymptotically pivotal quantities. 


7.3.2 Confidence sets based on likelihoods 


As we discussed in §7.3.1, a 1 — a asymptotically correct confidence set is 
asymptotically optimal in some sense if it is based on an asymptotically 
efficient point estimator. In parametric problems, it is shown in §4.5 that 
MLE’s or RLE’s are asymptotically efficient. Thus, in this section we study 
more closely the asymptotic confidence sets based on MLE’s and RLE’s or, 
more generally, based on likelihoods. 


Consider the case where P = {Po : 6 € O} is a parametric family 
dominated by a o-finite measure, where © C R*. For convenience, we 
consider 6 = (J, y) and confidence sets for 0 with dimension r. Let £(0) be 
the likelihood function based on the observation X = x. The acceptance 
region of the LR test defined in §6.4.1 with Oo = {@: 3 = Uo} is 


A(9o) = {x : £(o, Ga) = e°2/7£(8)}, 


where £(6) = supgeo £(9), £0, G9) = sup, ((3,), and cq is a constant 
related to the significance level a. Under the conditions of Theorem 6.5, if 
Ca is chosen to be x? 4, the (1—a)th quantile of the chi-square distribution 
x?, then . 


C(X) = {9 : £0, Gg) > => /74(6)} (7.20) 


is a 1 — a asymptotically correct confidence set. Note that this confidence 
set and the one given by (7.19) are generally different. 

In many cases —£(V, ~) is a convex function of J and, therefore, the set 
defined by (7.20) is a bounded set in R*; in particular, C(X) in (7.20) is a 
bounded interval when k = 1. 


In §6.4.2 we discussed two asymptotic tests closely related to the LR 
test: Wald’s test and Rao’s score test. When 09 = {0: 3 = Jo}, Wald’s 


498 7. Confidence Sets 


test has acceptance region 
A(9o) = {2 : (8 — 90)" {C7 n(B)J-*C} (8 — 80) S XFabs (7-21) 


where 6 = (J,¢) is an MLE or RLE of 0 = (0,¢), In(0) is the Fisher 
information matrix based on X, C7 = ( I, 0 ), and 0 is an r x (k — r) 
matrix of 0’s. By Theorem 4.17 or 4.18, the confidence set obtained by 
inverting A(vV) in (7.21) is the same as that in (7.19) with 6 = J and 
Vn = CT [In (0-2. 

When Oo = {0: J = Jo}, Rao’s score test has acceptance region 


A(¥o) _ {z : [sn (Vo, $I )l" Un(Yo, Go)! *8n(¥o, Pao) < Keak (7.22) 


where s,(0) = 0 log £(0)/00 and @y is defined in (7.20). The confidence set 
obtained by inverting A(W) in (7.22) is also 1 — a asymptotically correct. 
To illustrate these likelihood-based confidence sets and their differences, we 
consider the following two examples. 


Example 7.23. Let X1,...,X, be iid. binary random variables with p = 
P(X; = 1). Since confidence sets for p with a given confidence coefficient 
are usually randomized (§7.2.3), asymptotically correct confidence sets may 
be considered when n is large. 


The likelihood ratio for testing Hp : p = po versus H, : p # po is 
MY) = po (1 — po)"~* /6* (1 B)"™, 
where Y = 5°), X; and p = Y/n is the MLE of p. The confidence set 
(7.20) is then equal to 


Ci(X) = {p: p* (1—p)”-* > e-*a/2p¥ (1 — p)”-* }. 


When 0 < Y <n, —p® (1—p)"~” is strictly convex and equals 0 if p = 0 or 
1 and, hence, C,(X) = [p,p] with 0< p<p< 1. When Y = 0, (1—p)” is 
strictly decreasing and, therefore, C,(X) = (0,p] with 0 <p < 1. Similarly, 
when Y = n, Ci(X) = [p,1) with 0<p<1. 

The confidence set obtained by inverting acceptance regions of Wald’s 
tests is simply 


C2(X) = [P- 21~-a/2V B(1 — B)/n, P + 21~-a/2V BUI — B)/n], 


since In(p) = n/[p(1 — p)] and (x7,.)'/? = %1~a/2, the (1 — a/2)th quantile 
of N(0,1). Note that 


7.3. Asymptotic Confidence Sets 499 


and ; ? . 
2 -1 __ (Y—pn)* pll-p) _ n—p) 
salP) UA)" Ss os 
p(l—p)? on p(1 — p) 
Hence, the confidence set obtained by inverting acceptance regions of Rao’s 
score tests is 


C3(X) = {p: n(p — pp)? < p(l—p)xi.a}- 


It can be shown (exercise) that C3(X) = [p_, p+] with 


2Y + rae _ \/ Xi,ql4np(1 _ Dp) + Nie 


P+ os 
2(n + x7 4) 


Example 7.24. Let Xj,...,X, be iid. from N(u,y) with unknown 0 = 
(4, y). Consider the problem of constructing a 1— a asymptotically correct 
confidence set for 9. The log-likelihood function is 


i< n n 
log £(0) = =p wes —py- ) logy — 3 log(27). 
i=1 


Since (X,¢) is the MLE of 6, where ¢ = (n — 1)S?/n, the confidence set 
based on LR tests is 


1 n 
C\(X) = {" : reece —p)?+nlogy < x34 tn mbgeh, 


and 


2.0 
no (2). 


Hence, the confidence set based on Wald’s tests is 


cya) = fo: Roa , eet < Mal, 


g 20 n 


which is an ellipsoid in R?, and the confidence set based on Rao’s score 
tests is 


C3(X) = 0 aw A) Ea < X20 


yp 2 


500 7. Confidence Sets 


-0.4 -0.2 0.0 0.2 0.4 


Figure 7.2: Confidence sets obtained by inverting LR, Wald’s, and Rao’s 
score tests in Example 7.24 


In general, C;(X), 7 = 1,2,3, are different. An example of these three 
confidence sets is given in Figure 7.2, where n=100, 1=0, and y=1. 


Consider now the construction of a confidence set for yz. It can be shown 
(exercise) that the confidence set based on Wald’s tests is defined by C2(X) 
with y replaced by ¢, whereas the confidence sets based on LR tests and 
Rao’s score tests are defined by C\(X) and C3(X), respectively, with » 
replaced by n71 0", (Xi-y)?. 0 


In nonparametric problems, asymptotic confidence sets can be obtained 
by inverting acceptance regions of empirical likelihood ratio tests or profile 
empirical likelihood ratio tests (§6.5.3). We consider the following problem 
as an example. Let X4,...,X, be iid. from F and 6 = (v,y) be a k-vector 
of unknown parameters defined by E[1)(X1,0)] = 0, where w is a known 
function. Using the empirical likelihood 


e(G) = [> subject to p; > 0, So pi =1, S— pith(2i, 9) = 0, 
i=1 i=1 i=1 


7.3. Asymptotic Confidence Sets 501 


we can obtain a confidence set for J by inverting acceptance regions of the 
profile empirical likelihood ratio tests based on the ratio An, (X) in (6.91). 
This leads to the confidence set defined by 


n 


= : 1+ EnV, 8) —x? ,/2 
i ? II 1+ [Cn(9, 2)" (ai, ¥, S) ae et \ (7.23) 


where the notation is the same as that in (6.91) and x? , is the (1 — a)th 
quantile of the chi-square distribution y2 with r = the dimension of J. By 
Theorem 6.11, this confidence set is 1— a@ asymptotically correct. Inverting 
the function of ¥ in (7.23) may be complicated, but C(X) can usually 
be obtained numerically. More discussions about confidence sets based on 
empirical likelihoods can be found in Owen (1988, 1990, 2001), Chen and 
Qin (1993), Qin (1993), and Qin and Lawless (1994). 


7.3.3 Confidence intervals for quantiles 


Let X1,..., Xn be iid. from a continuous c.d.-f. F on R and let 6 = F~!(p) 
be the pth quantile of F', 0 < p< 1. The general methods we previously 
discussed can be applied to obtain a confidence set for 6, but we introduce 
here a method that works especially for quantile problems. 


In fact, for any given a, it is possible to derive a confidence interval 
(or bound) for 6 with confidence coefficient 1 — a (Exercise 84), but the 
numerical computation of such a confidence interval may be cumbersome. 
We focus on asymptotic confidence intervals for #. Our result is based on 
the following result due to Bahadur (1966). Its proof is omitted. 


Theorem 7.8. Let X1,..., Xp be i.i.d. from a continuous c.d.f. F on R 
that is twice differentiable at 6 = F~'(p), 0 < p < 1, with F’(6) > 0. 
Let {k,} be a sequence of integers satisfying 1 < k, < n and k,/n = 
p+ o((logn)?/V/n) for some 6 > 0. Let F, be the empirical c.d.f. defined 
in (5.1). Then 


ky /n) — F,(0 1 (1+6)/2 
X (ky) =O+ (kn /n) — Fn(9) /s) (8) +O ae ogn) ) as. ft 


n3/4 


The result in Theorem 7.8 is a refinement of the Bahadur representa- 
tion in Theorem 5.11. The following corollary of Theorem 7.8 is useful in 
statistics. Let 6, = F,'(p) be the sample pth quantile. 


Corollary 7.1. Assume the conditions in Theorem 7.8 and k,/n = p+ 
en—'/? + o(n-1/?) with a constant c. Then 


VIUX(,) — bn) as c/F'(O). 1 


502 7. Confidence Sets 


The proof of Corollary 7.1 is left as an exercise. Using Corollary 7.1, we 
can obtain a confidence interval for 6 with limiting confidence coefficient 
1 — a (Definition 2.14) for any given a € (0, 3). 


Corollary 7.2. Assume the conditions in Theorem 7.8. Let {kin} and 
{kon} be two sequences of integers satisfying 1 < kin < kon <n, 


kin/n a aa 21-a/2 V pl —p)/n T o(n-'/?), 
kon/n =pt 41-a/2V p(l —p)/n & o(n-/?), 


where z_ = ®~1(a). Then the confidence interval C(X) = [X(x,,);X(kon)] 
has the property that P(@ € C(X)) does not depend on P and 


and 


lim inf P(0 € O(X)) = lim P(@€C(X)) =1-a. (7.24) 
no PE noo 
Furthermore, 
 221~-a/2 p(1— p) i 
the length of C(x) => FOV +o Vn a.s. 


Proof. Note that P(X(x,,) < 9 < X(e,)) = PU (ian) < P< Uceop))s 
where U(x) is the kth order statistic based on a sample Uj, ..., Un i.i.d. from 
the uniform distribution U(0,1) (Exercise 84). Hence, P(@ € C(X)) does 
not depend on P and the first equality in (7.24) holds. 


By Corollary 7.1, Theorem 5.10, and Slutsky’s theorem, 


n — *1—-a/2 F'(0)/n 
7 P( Vin — 8) 


P(X(m,) > 8) = P (‘ vp(l ~p) + 09(n7¥/2) 2 ) 


V pL — p)/F"(9) 


— 1 onal ®(21_«/2) 
= a/2. 


+ o,(1) > ani] 


Similarly, P(X(x5,,) < 9) — a/2. Hence the second equality in (7.24) holds. 
The result for the length of C(X) follows directly from Corollary 7.1. I 


The confidence interval [X(x,,,); X(ks,)] given in Corollary 7.2 is called 
Woodruff’s (1952) interval. It has limiting confidence coefficient 1 — a, a 
property that is stronger than the 1 — a asymptotic correctness. 


From Theorem 5.10, if F’(0@) exists and is positive, then 


Vii(n — 8) 2a N (0, Hath), 


7.3. Asymptotic Confidence Sets 503 


If the derivative F’(@) has a consistent estimator d, obtained using some 
method such as one of those introduced in §5.1.3, then (7.18) holds with 
V, = p(1 — p)/d2 and the method introduced in §7.3.1 can be applied to 
derive the following 1 — a asymptotically correct confidence interval: 


“ Ip) 2 /pa— 
Ci(X) = | On — Z1-a/2 - =P) On + Z1-a/2 po 


The length of C\(X) is asymptotically almost the same as Woodruff’s in- 
terval. However, C(X) depends on the estimated derivative d,, and it is 
usually difficult to obtain a precise estimator d,. 


7.3.4 Accuracy of asymptotic confidence sets 


In §7.3.1 (Proposition 7.4) we evaluate a 1 — a asymptotically correct con- 
fidence set C(X) by its volume. We now study another way of assessing 
C(X) by considering the convergence speed of P(@ € C(X)). 


Definition 7.5. A 1— «a asymptotically correct confidence set C(X) for 6 
is said to be Ith-order (asymptotically) accurate if and only if 


P(6EC(X)) =1-a+O(n-"””), 
where | is a positive integer. I 


We focus on the case where @ is real-valued. For 6 € R, the confidence 


set given by (7.19) is the two-sided interval [9,/2, 40/2], where 9, = On — 

Bie Val 6, = 6 Api Val and z, = ®~!(a). Suppose that the c.d-f. of 

Vn? (6, — 6) admits the Edgeworth expansion (1.106) with m = 1. Then 
P(82>8,) =P (Va (bn - 8) < 21-0) 


= O(z1_4) + nV? pi (212) ®' (z1-«) + o(n—'/?) 
=1-a+0(n-¥), 


i.e., the lower confidence bound @, or the one-sided confidence interval 
[@.,,00) is first-order accurate. Similarly, 


PS) )=1—P (V7 6, _)< 7-2) 
= 1- 6(—2_4) — nV? py (— 24-2) ® (— 21a) + o(n-/?) 
=1-a+O(n-/%), 


i.e., the upper confidence bound 9, is first-order accurate. Combining these 
results and using the fact that ®’(x) and p;(x) are even functions (Theorem 


504 7. Confidence Sets 


1.16), we obtain that 
P(Byj2 $0 < Bayz) =1-ato(n-), 


which indicates that the coverage probability of [6 


«/2» 9/2 | converges to 
1—aat arate faster than n~!/?. In fact, if we assume that the c.d.f. of 
VG. — 0) admits the Edgeworth expansion (1.106) with m = 2, then 


P(8a/2 <O< 94/2) = 1—at2n7"p2(z1~0/2)®' (Z1-0/2) +0(n"), (7.25) 


i.e., the equal-tail two-sided confidence interval [ 9, 121 9c /2| is second-order 
accurate. 

Can we obtain a confidence bound that is more accurate than 6, (or 0.) 
or a two-sided confidence interval that is more accurate than [6, [2 9c j2)? 
The answer is affirmative if a higher order Edgeworth expansion is available. 
Assume that the conditions in Theorem 1.16 are satisfied for an integer 
m > 2. Using the arguments in deriving the polynomial qg; in (1.108), we 
can show (exercise) that 


P( ¥-¥?(6, - 8) < a1 > Bere) | <1 a+ O(n"), (7.26) 


If the coefficients of polynomials qj, ...,@m—1 are known, then 
A _ yy-1/2 5 ( l-a 
On — Vy, a- a+ os Tp ca 


is an mth-order accurate lower confidence bound for 6. In general, however, 
some coefficients of g;’s are unknown. Let q; be the same as q; with all 
unknown coefficients in the polynomial q; replaced by their estimators, j = 
1,...,.m—1. Assume that 41(z1-2) — 91(z1-2) = Op(n-/?), ie., G1 (21-2) 
is \/n-consistent. Then, the lower confidence bound 


0?) = 6, — Va [zo +0 7h (z1-2)] 
is second-order accurate (Hall, 1992). A second-order accurate upper con- 
fidence bound 0), can be similarly defined. However, the two-sided confi- 
dence interval (a) /29 oe s ] is only second-order accurate, i.e., in terms of the 
convergence speed, it does not improve the confidence interval [6,, /2 Oe /2|- 


Higher order accurate confidence bounds and two-sided confidence in- 
tervals can be obtained using Edgeworth and Cornish-Fisher expansions. 
See Hall (1992). 


7.4. Bootstrap Confidence Sets 505 


Example 7.25. Let Xj,...,X, be iid. random d-vectors. Consider the 
problem of setting a lower confidence bound for 6 = g(), where uw = EX, 
and g is five times continuously differentiable from R4 to R with Vg(p) 4 0. 
Let X be the sample mean and 6, = GX ¥). Then, V, in (7.19) can be cho- 
sen to be n~?2(n — 1)[Vg(X)]7S2Vg(X), where 2 is the sample covariance 
matrix. Let X;; be the jth component of X; and Y; be a vector containing 
components of the form X;; and X,;Xi;/, j = 1,...,d, 7’ > j. It can be 
shown (exercise) that VO. — 0) can be written as /nh(Y)/on, where 
Y is the sample mean of Y;’s, h is a five times continuously differentiable 
function, h(EY;) = 0, and o? = [Vh(EY;)]’ Var(¥i1)VhA(EY;). Assume 
that E||Yi||* < co and that Cramér’s continuity condition (1.105) is sat- 
isfied. By Theorem 1.16, the distribution of Va (8, — 0) admits the 
Edgeworth expansion (1.106) with m = 2 and p(x) given by (1.107). Since 
q(x) = —pi(x) (§1.5.6), we obtain the following second-order accurate 
lower confidence bound for @: 


O°?) = 6, — VP fag +? e165 | — 6G, 3 (27_. — DI}, 


where 67 = [Vh(Y)]7S?-Vh(Y), S% is the sample covariance matrix based 
on Y;’s, and é; is the estimator of c; in (1.107) obtained by replacing the 
moments of Y; with the corresponding sample moments. 


In particular, if d= 1 and g(x) = x, then it follows from Example 1.34 
that - 
62) = XK — n- VS [21_ 4 — 6 nh 3(222_, + 1], 


where G2 = n7-! 57", (Xi — X)? and &g = n-1 0 (X1 — X)3/69 is Yn- 
consistent for K3 = E(X1 — pw)?/o? with o? = Var(X1). A second-order 
accurate lower confidence bound for o? can be similarly derived (Exercise 
89). 


7.4 Bootstrap Confidence Sets 


In this section, we study how to use the bootstrap method introduced in 
§5.5.3 to construct asymptotically correct confidence sets. There are two 
main advantages of using the bootstrap method. First, as we can see from 
previous sections, constructing confidence sets having a given confidence 
coefficient or being asymptotically correct requires some theoretical deriva- 
tions. The bootstrap method replaces these derivations by some routine 
computations. Second, confidence intervals (especially one-sided confidence 
intervals) constructed using the bootstrap method may be asymptotically 
second-order accurate (§7.3.4). 


We use the notation in 85.5.3. Let X = (X1,..., Xn) be asample from P. 
We focus on the case where X;’s are i.i.d. so that P is specified by ac.d.f. F 


506 7. Confidence Sets 


on R%, although some results discussed here can be extended to non-i.i.d. 
cases. Also, we assume that F' is estimated by the empirical c.d.f. F,, defined 
in (5.1) (which means that no assumption is imposed on F' and the problem 
is nonparametric) so that Pin §5.5.3 is the population corresponding to F;,. 
A bootstrap sample X* = (XY,...,X) is obtained from P, ie., X*’s are 
iid. from F;,. Some other bootstrap sampling procedures are described in 
Exercises 92 and 95-97. Let 6 be a parameter of interest, 6, be an estimator 
of 6, and 6* be the bootstrap analogue of Bn, i Le., 6* is the same as 6, except 
that X is replaced by the bootstrap sample X « 


7.4.1 Construction of bootstrap confidence intervals 


We now introduce several different ways of constructing bootstrap confi- 
dence intervals for a real-valued 6. Some ideas can be extended to the 
construction of bootstrap confidence sets for multivariate 6. We mainly 
consider lower confidence bounds. Upper confidence bounds can be simi- 
larly obtained and equal-tail two-sided confidence intervals can be obtained 
using confidence bounds. 


The bootstrap percentile 


Define : 
Kp(x) = P.(0% <x), (7.27) 
where P, denotes the distribution of X* conditional on X. For a given 


a € (0,5), the bootstrap percentile method (Efron, 1981) gives the following 
lower confidence bound for @: 


Opp = Kz (2). (7.28) 


The name percentile comes from the fact that K;'(q) is a percentile of the 
bootstrap distribution Kg in (7.27). 


For most cases, the computation of Opp requires numerical approxi- 
mations such as the Monte Carlo approximation described in §5.5.3. A 
description of the Monte Carlo approximation to bootstrap confidence sets 
can be found in §7.4.3 (when bootstrap prepivoting is discussed). 


We now provide a justification of the bootstrap percentile method that 
allows us to see what assumptions are required for a good performance of a 
bootstrap percentile confidence set. Suppose that there exists an increasing 
transformation ¢,,(x) such that 


P(¢n — n(9) < 2) = V(z) (7.29) 


holds for all possible F (including F = F;,), where bn = bn(On) and WU 
is a c.d.f. that is continuous, strictly increasing, and symmetric about 0. 


7.4. Bootstrap Confidence Sets 507 


When WV = 9, the standard normal distribution, the function ¢,, is called 
the normalizing and variance stabilizing transformation. If ¢, and WV in 
(7.29) can be derived, then the following lower confidence bound for 6 has 
confidence coefficient 1 — a: 


On = bn (bn st Ze), 


where z_ = U~1(a). 
We now show that 0pp = @, and, therefore, we can still use this lower 


confidence bound without deriving ¢, and UV. Let wy, = dn(@pp) — dn: 
From the fact that assumption (7.29) holds when F is replaced by F;,, 


U(wn) = P.(o% = on < Wn) = P,(6% < Opp) =a, 


where or a bn(O* ) and the last equality follows from the definition of 6 5p 
and the assumption on VW. Hence wy, = zq = Y~'(a) and 


Opp = bn (bn + Zen!) = On. 


Thus, the bootstrap percentile lower confidence bound §,p has confi- 
dence coefficient 1 — @ for all n if assumption (7.29) holds exactly for all 
n. If assumption (7.29) holds approximately for large n, then Opp is 1—a 
asymptotically correct (see Theorem 7.9 in §7.4.2) and its performance de- 
pends on how good the approximation is. 


The bootstrap bias-corrected percentile 


Efron (1981) considered the following assumption that is more general than 
assumption (7.29): 


P(¢bn — bn(0) + 20 < 2) = V(z), (7.30) 


where ¢,, and W are the same as those in (7.29) and zo is a constant that 
may depend on F and n. When z = 0, (7.30) reduces to (7.29). Since 
(0) = 4, zo is a kind of “bias” of dn. If dn, 20, and W in (7.30) can 
be derived, then a lower confidence bound for 6 with confidence coefficient 
l-—ais i 

On = bn (bn + Za + 29). 


Applying assumption (7.30) to F = F,,, we obtain that 
Kp(6n) = P, (3% = bn + 20S 20) = WV(z0), 
where Kg is given in (7.27). This implies 


zy = V1 (Kp(6n)). (7.31) 


508 7. Confidence Sets 


Also from (7.30), 


l-a= W(—2a) 
= P.(6% — bn + 20 < —2a) 
= P.(8% < bn (bn — Za — 20); 


which implies : 
On’ (bn — 2a — 20) = Kg'(1— a). 
Since this equation holds for any a, it implies that for 0 <a <1, 


K5 (a) = $7) (bn + U(x) — 20). (7.32) 
By the definition of @, and (7.32), 
On = KR (U(za + 220)). 


Assuming that WU is known (e.g., UV = ®) and using (7.31), Efron (1981) ob- 
tained the bootstrap bias-corrected (BC) percentile lower confidence bound 
for 0: . 

O50 = KR (V(za + 2U*(KB(On)))), (7.33) 
which is a percentile of the bootstrap distribution Kg. Note that @p¢ 


reduces to Opp if Kp(On) = $, ie., Oy is the median of the bootstrap 
distribution Kg. Hence, the bootstrap BC percentile method is a bias- 
corrected version of the bootstrap percentile method and the bias-correction 
is represented by 2U~!(Kg(6,)). If (7.30) holds exactly, then Opc has 
confidence coefficient 1 — @ for all n. If (7.30) holds approximately, then 


Ozc is 1 — a asymptotically correct. 


The bootstrap BC percentile method improves the bootstrap percentile 
method by taking a bias into account. This is supported by the theoretical 
result in §7.4.2. However, there are still many cases where assumption (7.30) 
cannot be fulfilled nicely and the bootstrap BC percentile method does not 
work well. Efron (1987) proposed a bootstrap accelerated bias-corrected 
(BC,) percentile method (see Exercise 93) that improves the bootstrap BC 
percentile method. However, applications of the bootstrap BC, percentile 
method involve some derivations that may be very complicated. See Efron 
(1987) and Efron and Tibshirani (1993) for details. 


The hybrid bootstrap 


Suppose that 6, is asymptotically normal, i.e., (7.18) holds with V,, = 0?,/n. 
Let H,, be the c.d.f. of ./n(6, — 0) and 


Ap(0) = P. (V0, ~ bn) <2) 


7.4. Bootstrap Confidence Sets 509 


be its bootstrap estimator defined in (5.121). From the results in Theorem 
5.20, for any t € (0,1), Hy'(t) — H71(t) —p 0. Treating the quantile of 


n 


Hg as the quantile of H,, we obtain the following hybrid bootstrap lower 
confidence bound for @: 


Onp = On —n-/? HS 1(1 — a). (7.34) 


The bootstrap-t 


Suppose that (7.18) holds with V,, = 07./n and 6% is a consistent estimator 
of o%. The bootstrap-t method is based on t(X,0) = Vn(On — 0)/er, 
which is often called a studentized “statistic”. If the distribution G,, of 
t(X,0@) is known (i.e., t(X,@) is pivotal), then a confidence interval for 0 
with confidence coefficient 1—a can be obtained (§7.1.1). If G,, is unknown, 
it can be estimated by the bootstrap estimator 


G(x) = P,(t(X*,6n) < 2), 


where t(X*,6n) = /n(0* —6n)/G% and 6% is the bootstrap analogue of &p. 
Treating the quantile of Gg as the quantile of G,,, we obtain the following 
bootstrap-t lower confidence bound for @: 


Oar = On — 076 pG3 (1 — a). (7.35) 


Although it is shown in §7.4.2 that @p-7 in (7.35) is more accurate than 
Opp in (7.28), 8g in (7.33), and 6;,, in (7.34), the use of the bootstrap-t 
method requires a consistent variance estimator 6?,. 


7.4.2 Asymptotic correctness and accuracy 


From the construction of the hybrid bootstrap and bootstrap-t confidence 
bounds, 8;;p is 1— a asymptotically correct if 0.(Hp, Hn) >» 0, and Opp 
is 1 — a asymptotically correct if Ou (Gp, Gn) —, 0. On the other hand, 
the asymptotic correctness of the bootstrap percentile (with or without 
bias-correction or acceleration) confidence bounds requires slightly more. 


Theorem 7.9. Suppose that 0oo(Hp, H,) +p 0 and 


lim po(Hn, H) = 0, (7.36) 


where H is ac.d.f. on R that is continuous, strictly increasing, and symmet- 
ric about 0. Then @,p in (7.28) and @p¢ in (7.33) are 1— a asymptotically 


510 7. Confidence Sets 


correct. 
Proof. The result for 02 p follows from 


= P(Valbn - 8) < -H5'(a)) 

= P(Vn(6n — 8) < —H~(a)) + 0(1) 
= (—H~*(a)) + o(1) 

= 1-—a+o(1). 


The result for 93¢ follows from the previous result and 


i w(K p(6n)) = U-! (fp(0)) —, U'(H(0)) =0. 4 


Theorem 7.9 can be obviously extended to the case of upper confidence 
bounds or two-sided confidence intervals. The result also holds for the 
bootstrap BC, percentile confidence intervals. 


Note that H in (7.36) is not the same as V in assumption (7.30). Usually 
H(x) = ®(x/o,-) for some op > 0, whereas V = ®. Also, condition (7.36) 
is much weaker than assumption (7.30), since the latter requires variance 
stabilizing. 

It is not surprising that all bootstrap methods introduced in 87.3.1 pro- 
duce asymptotically correct confidence sets. To compare various bootstrap 
confidence intervals and other asymptotic confidence intervals, we now con- 
sider their asymptotic accuracy (Definition 7.5). 


Consider the case of @ = g(j1), = EX, and 6, = g(X), where X is the 
sample mean and g is five times continuously differentiable from R% to R 
with Vg(j) £0. The asymptotic variance of \/n(6, — 0) can be estimated 
by 6% = =+[Vg(X)]"S?Vg(X), where $? is the sample covariance matrix. 
Let G, be the distribution of /n(6n —9)/@p. If Gn is known, then a lower 
confidence bound for @ with confidence coefficient 1 — a is 


On = bn —n-?6 Go (1 — a), (7.37) 


which is not useful if G, is unknown. 

Assume that E||X;,||® < oo and condition (1.105) is satisfied. Then G,, 
admits the Edgeworth expansion (1.106) with m = 2 and, by Theorem 1.17, 
G,,1(t) admits the Cornish-Fisher expansion 


Gy) =a+ font + wen) =) + (=). (7.38) 


7.4. Bootstrap Confidence Sets 511 


where q;(-, £’) is the same as q;(-) in Theorem 1.17 but the notation q;(-, F’) 
is used to emphasize that the coefficients of the polynomial q; depend on F’, 
the c.d.f. of X,. Let Ge be the bootstrap estimator of G,, defined in §7.4.1. 
Under some conditions (Hall, 1992), Ga} admits expansion (7.38) with 
F replaced by the empirical c.d.f. F;, for almost all sequences Xj, X9,.... 
Hence the bootstrap-t lower confidence bound in (7.35) can be written as 


0 _ 4A OF qj (Z1-a, Fy) 1 . 7.39 
ZBT — Yn — Vn Za t+ ) gfe +0 ed a.s. ( : ) 
j=l 


Under some moment conditions, q,(a, Fn) — q;(@, F) = Op(n~'/?) for each 
x, j =1,2. Then, comparing (7.37), (7.38), and (7.39), we obtain that 


Oper — 9% = Op(n-*”). (7.40) 
Furthermore, 
P(Op7 <0) =P bn 4 Ee Org (lees?) 
Yar > ¥) = Saiyn 

6, — 0 z q; (Z1-a; Fn) 1 

= < J = 

Gea Stet EG) +445) 
i 
259 sopepes W(Z1-a) ® (z1-2) o(+), (7.41) 
n nr 


where w(2) is a polynomial whose coefficients are functions of moments of F’ 
and the last equality can be justified by a somewhat complicated argument 
(Hall, 1992). 

Result (7.41) implies that 0, is second-order accurate according to 
Definition 7.5. The same can be concluded for the bootstrap-t upper con- 
fidence bound and the equal-tail two-sided bootstrap-t confidence interval 
for 6. 

Next, we consider the hybrid bootstrap lower confidence bound @;;, 
given by (7.34). Let Hg be the bootstrap estimator of H,,, the distribution 
of Vn(6n — 0)/or. Then Hp'(1— a) = 6pHg'(1—a) and 


Onp = 6, — n?6pHe (1 —a), 
which can be viewed as a bootstrap approximation to 
Op = On — nS pH, (1 — a). 


Note that @;, does not have confidence coefficient 1— a, since it is obtained 
by muddling up G;'(1—a) and H,'(1— a). Similar to G71, H7! admits 


512 7. Confidence Sets 


the Cornish-Fisher expansion 
s ,F do (24, F 1 
H,*(t) = 2+ Gilt F) | bln F) ft (7.42) 
Jn n 


and H' admits the same expansion (7.42) with F replaced by F, for 
almost all X1, X2,.... Then 


a z as Fn 1 
Ue — Oe ~ Ba é >» G( SS *+0(2)| a.s. (7.43) 


and, by (7.37), 

9p - 95 = Op(n~*), (7.44) 
since q,(x, F) and q,(z, F) are usually different. Results (7.40) and (7.44) 
imply that @;;, is not as close to Op as Opp. Similarly to (7.41), we can 
show that (Hall, 1992) 


j=l 
On — 9 = Gi (Z1e5 Ff) 1 
=P ag J nil? ) +0(2) 
= (21a) ® (z1_-@) 1 
=1-a+ Hee Crs) 4 0(-), (7.45) 


where w(a) is an even polynomial. This implies that when 7 4 0, @ wp is 
only first-order accurate according to Definition 7.5. 


The same conclusion can be drawn for the hybrid bootstrap upper con- 
fidence bounds. However, the equal-tail two-sided hybrid bootstrap confi- 
dence interval 


[Ona Gap |= [Bn = imei al =); bn = n—1/? HE (a) | 


is second-order accurate (and 1—2a asymptotically correct), as is the equal- 
tail two-sided bootstrap-t confidence interval, since 


Pup <9 <OnB) = P(O< OnB) — P(O < Sy) 
=l-a+ nb (219) ® (z1-a) 
— a— nV ab( 24)! (za) + O(n=1) 
= 1-2a+ O(n) 


7.4. Bootstrap Confidence Sets 513 


by the fact that w and ®’ are even functions and z1_, = —Zq. 
For the bootstrap percentile lower confidence bound in (7.28), 


Opp = Kg (a) = 6, +27 He (a). 


Comparing @pp with 0p, and 9;;p, we find that the bootstrap percentile 
method muddles up not only H3!(1—a) and Gz! (1— a), but also H3!(a) 
and es ee —a). If Hg is asymptotically symmetric about 0, then the 
bootstrap percentile method is equivalent to the hybrid bootstrap method 
and, therefore, one-sided bootstrap percentile confidence intervals are only 
first-order accurate and the equal-tail two-sided bootstrap percentile confi- 
dence interval is second-order accurate. 

Since 6,, is asymptotically normal, we can use VW = ® for the bootstrap 
BC percentile method. Let Gp, = ®(zq + 229). Then the bootstrap BC 
percentile lower confidence bound given by (7.33) is just the @,th quantile 
of Kg in (7.27). Using the Edgeworth expansion, we obtain that 


Kp(6q) = Hp(0) = 0(0) + OIE O | o,(+) 


n 


with some function q and, therefore, 


an sap HOE IWC) 5 9,(2) 


This result and the Cornish-Fisher expansion for tig’ imply 


~ 2q(0, Fn dy (Za, Ln 
gf ORB) 5 (1), 


Then from (7.33) and Kp! (Gn) = On + 7/26 rH (Gn), 


‘ Or 2q(0, Fn) + & (Za, Fn) 1 
Oac = On + S| Za + ——___ —}}. ; 
YBa + Vn Zat Vn +O, 5 (7 46) 
Comparing (7.37) with (7.46), we conclude that 
Onc — On = Op(n™). 
It also follows from (7.46) that 
P(Z1-a) © (z1-a) 1 
P(@pc <0) =1- - ° 


with an even polynomial 7)(x). Hence @3¢ is first-order accurate in general. 
In fact, 
Onc —9up = 24(0, F,)érn-! + O,(n-9/?) 


514 7. Confidence Sets 


and, therefore, the bootstrap BC percentile and the bootstrap percentile 
confidence intervals have the same order of accuracy. The bootstrap BC 
percentile method, however, is a partial improvement over the bootstrap 
percentile method in the sense that the absolute value of ¢)(z1_q) in (7.47) 
is smaller than that of ¢)(z1_.) in (7.45) (see Example 7.26). 


While the bootstrap BC percentile method does not improve the boot- 
strap percentile method in terms of accuracy order, Hall (1988) showed that 
the bootstrap BC, percentile method in Efron (1987) produces second-order 
accurate one-sided and two-sided confidence intervals and that (7.40) holds 
with 0,7 replaced by the bootstrap BC, percentile lower confidence bound. 


We have considered the order of asymptotic accuracy for all bootstrap 
confidence intervals introduced in §7.4.1. In summary, all two-sided con- 
fidence intervals are second-order accurate; the one-sided bootstrap-t and 
bootstrap BC, percentile confidence intervals are second-order accurate, 
whereas the one-sided bootstrap percentile, bootstrap BC percentile, and 
hybrid bootstrap confidence intervals are first-order accurate; however, the 
latter three are simpler than the former two. 


Note that the results in §7.3.4 show that asymptotic confidence intervals 
obtained using the method in §7.3.1 have the same order of accuracy as the 
hybrid bootstrap confidence intervals. 


Example 7.26. Suppose that d = 1 and g(x) = x. It follows from the 
results in §1.5.6 that expansions (7.38) and (7.42) hold with q,(2,F) = 
—7(2a? + 1)/6, qo(x, F) = a[(x? + 3)/4 — w(x? — 3)/12 + 57?(42* — 1)/72], 
4 (2, F) = (a? — 1)/6, and Go(x, F) = a[h(2? — 3)/24 — +?(2x? — 5)/36), 
where y = 3 = E(X, — p)?/o? (skewness), & = E(X, — p)*/o+ — 3 
(kurtosis), and 0? = Var(X}). 

The function ~ in (7.41) is equal to 7(1+2x7)(K—3y7/2)/6; the function 
w in (7.45) is equal to yx?/2; and the function 7(x) in (7.47) is equal to 


7(x? + 2)/6 (see Liu and Singh (1987)). If y 40, then 877, Apc, and the 
A —1/2 


asymptotic lower confidence bound 0, = 0, —n Opr21—q are first-order 
accurate. In this example, we can still compare their relative performances 
in terms of the convergence speed of the coverage probability. Let 


e(@) = P(@< 9) -(1-a) 


be the error in coverage probability for the lower confidence bound @. It 
can be shown (exercise) that 


le@z)| = le(8y)| + Cn(za,F) + 0(n-"/) (7.48) 


and 
le(@n)| = le(@zc)| + Ca(Za, F) + o(n-?), (7.49) 


7.4. Bootstrap Confidence Sets 515 


where C,,(2, F) = |y|(x? — 1)®’(x)/(6./n). Assume y 4 0. When 2? > 1, 
which is usually the case in practice, C,(za, F) > 0 and, therefore, @2¢ is 
better than @,,, which is better than 0;,;3. The use of 6, requires a variance 
estimator 6%, which is not required by the bootstrap BC percentile and 
hybrid bootstrap methods. When a variance estimator is available, we can 
use the bootstrap-t lower confidence bound, which is second-order accurate 
even wheny 40. I 


7.4.3 High-order accurate bootstrap confidence sets 


The discussion in §7.3.4 shows how to derive second-order accurate confi- 
dence bounds. Hall (1992) showed how to obtain higher order accurate con- 
fidence sets using higher order Edgeworth and Cornish-Fisher expansions. 
However, the theoretical derivation of these high order accurate confidence 
sets may be very complicated (see Example 7.25). The bootstrap method 
can be used to obtain second-order or higher order accurate confidence sets 
without requiring any theoretical derivation but requiring some extensive 
computations. 


The bootstrap prepivoting and bootstrap inverting 


The hybrid bootstrap and the bootstrap-t are based on the bootstrap dis- 
tribution estimators for /7(0, —0) and \/n(6, —0)/ap, respectively. Beran 
(1987) argued that the reason why the bootstrap-t is better than the hybrid 
bootstrap is that /n(6, — 0)/ér is more pivotal than \/n(6, — 9) in the 
sense that the distribution of /n(0n — 0)/@p is less dependent on the un- 
known F’. The bootstrap-t method, however, requires a variance estimator 
6?.. Beran (1987) suggested the following method called bootstrap prepivot- 


ing. Let RY be a random function (such as /7(6n — 0) or V/7(On — 9) /6F) 
used to construct a confidence set for 6 € R*, HO be the distribution of 
gO) and let AY be the bootstrap estimator of HO, Define 


RO = AY RO). (7.50) 


If HO is continuous and if we replace AY in (7.50) by HO, then gy 
has the uniform distribution U(0,1). Hence, it is expected that RY is 
more pivotal than go), Let AY be the bootstrap estimator of HO, the 
distribution of RG). Then RY) = AD RD) is more pivotal than RY. 
In general, let HY ) be the distribution of RU ) and AY be the bootstrap 
estimator of H®, j =0,1,2,.... Then we can use the following confidence 
sets for 0: 


Clan X)= (02k) < (GY) GSo)}, F=01,2c0. (751) 


516 7. Confidence Sets 


Note that for each j, oo ppl) is a hybrid ee confidence set based 
on RY ) Since RY +) is more pivotal than RY , we obtain a sequence of 
confidence sets with increasing accuracies. Beran (1987) showed that if 
the distribution of /7(6, — 0) has a two-term Edgeworth expansion, then 
the one-sided confidence interval ce) ep(X) based on QO = Vn(6n — 8) 
is second-order accurate, and the two-sided confidence interval ct) ppl) 
based on RY = Vil6n — 0| is third-order accurate. Hence, bootstrap 
prepivoting with one iteration improves the hybrid bootstrap method. It is 
expected that the one-sided confidence interval ei ep(X) based on gO = 
Vn(6, — ) is third-order accurate, i.e., it is better than the one-sided 
bootstrap-t or bootstrap BC, percentile confidence interval. More detailed 
discussion can be found in Beran (1987). 


It seems that, using this iterative method, we can start with a p00) 
and obtain a bootstrap confidence set that is as accurate as we want it 
to be. However, more computations are required for higher stage boot- 
strapping and, therefore, the practical implementation of this method is 
very hard, or even impossible, with current computational ability. We ex- 
plain this with the computation of OO ae (X) based on QO = R(X, F). 
Suppose that we use the Monte Carlo approximation. Let {Xj,,...,X,} 
be iid. samples from F,,, b = 1,...,B,. Then A) is approximated by 
HOB) the empirical distribution of {RO ; b=1,..., Bi}, where gm — 
REM yes Xnoy Fn ). For each }, let F*, be the aripirical dstaboton of 
Xtb. 5 Xho» (Xj Xnogt be iid. samples from Fr,, 7 = 1,...,Ba, HE 
be the empirical c.d.f. of {Rn(XTp;, +) Xnbjp Pro) I = L- Be arid a= 
HF (RO*), Then AY can be approximated by H AY PsBa) , the empirical 
distribution of {z{,6 = 1,..., Bi}, and the Gnas: set Cap (X) can 
be approximated by 


fo: RX, PY S (BE?) (AG) o) } 


The second-stage bootstrap sampling is nested in the first-stage bootstrap 
sampling. Thus the total number of bootstrap data sets we need is B, Bo, 
which is why this method is also called the double bootstrap. If each 
stage requires 1,000 bootstrap replicates, then the total number of boot- 
strap replicates is 1,000,000! Similarly, to compute ee p(X) we need 
(1,000)/*! bootstrap replicates, 7 = 2,3,..., which limits the application of 
the bootstrap prepivoting method. 

A very similar method, bootstrap inverting, is given in Hall (1992). In- 
stead of using (7.51), we define 


CO? p(X) = {0: RY < (AY) 710-a@}, 7 =0,1,2, 


7.4. Bootstrap Confidence Sets 517 


where 
RY) = RO-D _ (AY AY ye +a)... GS Leia 


and H io ) is the bootstrap estimator of the distribution of RY )” For each 
gel, CAs) and GW) (xX) in (7.51) have the same order of accu- 
racy and require the same amount of computation. They are special cases 
of a general iterative bootstrap introduced by Hall and Martin (1988). Hall 
(1992) showed that confidence sets having the same order of accuracy as 
Go. p(X) can also be obtained using Edgeworth and Cornish-Fisher ex- 
pansions. Thus, the bootstrap method replaces the analytic derivation of 
Edgeworth and Cornish-Fisher expansions by extensive computations. 


Bootstrap calibrating 


Suppose that we want a confidence set C(X) with confidence coefficient 
1 — a, which is called the nominal level. The basic idea of bootstrap cali- 
brating is to improve C(X) by adjusting its nominal level. Let 7, be the 
actual coverage probability of C(X). The value of 7, can be estimated by 
a bootstrap estimator 7,. If we find that 7, is far from 1— a, then we 
construct a confidence set C,(X) with nominal level 1 — @ so that the cov- 
erage probability of C;(X) is closer to 1—a@ than 7,,. Bootstrap calibrating 
can be used iteratively as follows. Estimate the true coverage probability 
of C1(X); if the difference between 1 — a and the estimated coverage prob- 
ability of Ci(X) is still large, we can adjust the nominal level again and 
construct a new calibrated confidence set C2(X). 


The key for bootstrap calibrating is how to determine the new nominal 
level 1 — @ in each step. We now discuss the method suggested by Loh 
(1987, 1991) in the case where the initial confidence sets are obtained by 
using the method in 87.3.1. Consider first the asymptotic lower confidence 
bound @y = 6, —n~*/26 p21 considered in Example 7.26. The coverage 
probability 7, = P(@y < 0) can be estimated by the bootstrap estimator 
(approximated by Monte Carlo if necessary) 


Tn — Ga(z1_a) = P.(/n(6% _ On) /F% < Bisa) 


When the bootstrap distribution admits the Edgeworth expansion (1.106) 
with m = 3, we have 


es (21-0) Fn) qo(Z1-a: Fn) ! 1 
DRTC agg ag, = |e) Os ae 


Let A be any increasing, unbounded, and twice differentiable function on 
the interval (0,1) and 


&=1—h7'(h(1—a) — 9), 


518 7. Confidence Sets 


where 


6= h(in) = ACL — a) 


d(Z1-a, Fn) , (21-0; Fn) | «, " Mel noes 
ae), we Je (z1-a)h'(1— 0) 
4 obra Fy)? rma) FJ? (ial (1 — a) +0, (ss) (7.52) 


The bootstrap calibration lower confidence bound is 


H —1/2* 
Oorp =OIn-—7 Pe mz_a. 


By (7.52), 
7 1 (21-0; Fn) ®’ (z1-a) u 
1l-a@=1- SS - ; 
a a+ a + Oy a (7.53) 
and 
qd (Z1-a; Fy) 1 
Ae = Zicest a a + Op (=) (7.54) 


(exercise). Thus, 


hj 6 -a; Pn 1 
Bou = bn — 7 ang + BEE) 4 0,(2)], (7.55) 


Comparing (7.55) with (7.39), we find that 


9orB —8ar = Op(n-9/”). 


Thus, @¢; is second-order accurate. 


We can take [@¢7,, 9cLB] as a two-sided confidence interval; it is still 
second-order accurate. By calibrating directly the equal-tail two-sided con- 
fidence interval 


[On, On | = [On _ io aptiee On + no Gnz_ a 1, (7.56) 


we can obtain a higher order accurate confidence interval. Let 7, be the 
bootstrap estimator of the coverage probability P(@y < 6 < @n), 6 = 
h(n) — h(1— 2a), and & = [1—h~!(h(1— 2a) — 6)]/2. Then the two-sided 
bootstrap calibration confidence interval is the interval given by (7.56) with 
a replaced by &. Loh (1991) showed that this confidence interval is fourth- 
order accurate. The length of this interval exceeds the length of the interval 
in (7.56) by an amount of order O,(n~°/?). 


7.5. Simultaneous Confidence Intervals 519 


7.5 Simultaneous Confidence Intervals 


So far we have studied confidence sets for a real-valued @ or a vector-valued 
8 with a finite dimension k. In some applications, we need a confidence set 
for real-valued 6, with t € J, where T is an index set that may contain 
infinitely many elements, for example, T = [0,1] or T = R. 


Definition 7.6. Let X be a sample from P € P, let @:, t € J, be real- 
valued parameters related to P, and let C;(X), t € T, be a class of (one- 
sided or two-sided) confidence intervals. 

(i) Intervals C;(X), t € T, are level 1 — a simultaneous confidence intervals 
for 64, t € T, if and only if 


inf, P(@ € C,(X) for allt € T) > 1 ~<a. (7.57) 


The left-hand side of (7.57) is the confidence coefficient of C;(X), t € T. 
(ii) Intervals C;(X), t € T, are simultaneous confidence intervals for @:, 
t € T, with asymptotic significance level 1 — a if and only if 


lim P(0, € C;(X) for allt € 7) >1-a. (7.58) 


Intervals C,(X), t € T, are 1 — a asymptotically correct if and only if the 
equality in (7.58) holds. 


If the index set T contains k < oo elements, then 0 = (@:,t € T) isa 
k-vector and the methods studied in the previous sections can be applied 
to construct a level 1 — @ confidence set C(X) for 6. If C(X) can be 
expressed as [],<7 C;(X) for some intervals C;(X), then C;(X), t € T, are 
level 1—a@ simultaneous confidence intervals. This simple method, however, 
does not always work. In this section, we introduce some other commonly 
used methods for constructing simultaneous confidence intervals. 


7.5.1 Bonferroni’s method 


Bonferroni’s method, which works when JT contains k < oo elements, is 
based on the following simple inequality for k events A,,..., Ap: 


k k 
P ( a <>) P(A) (7.59) 


(see Proposition 1.1). For each t € T, let C;,(X) be a level 1 — a, confidence 
interval for ;. If a;’s are chosen so that }),-7 a4 = a (e.g., at = a/k 
for all t), then Bonferroni’s simultaneous confidence intervals are C;(X), 


520 7. Confidence Sets 


t € T. It can be shown (exercise) that Bonferroni’s intervals are of level 
1— a, but they are not of confidence coefficient 1 — a even if C;(X) has 
confidence coefficient 1 — a; for any fixed t. Note that Bonferroni’s method 
does not require that C;(X), t € T, be independent. 


Example 7.27 (Multiple comparison in one-way ANOVA models). Con- 
sider the one-way ANOVA model in Example 6.18. If the hypothesis Ho 
in (6.53) is rejected, one typically would like to compare j1;’s. One way to 
compare j4;’s is to consider simultaneous confidence intervals for pis — p;, 
1<i<j<m. Since X;;’s are independently normal, the sample means 
X;. are independently normal N(j1;,07/n;), i = 1,...,m, respectively, and 
they are independent of SSR = D0", Oy (Xiy — X;.)?.. Consequently, 
(X;. — Xi) / [Oy has the t-distribution th_m, 1 <i < j < m, where 
vig = (np t+ nj ')SSR/(n —m). For each (i,j), a confidence interval for 
[44 — ft7 With confidence coefficient 1 — a is 


Cij,0(X) = [ Xi. = X;. == tn—m,a/2V Vij» Xi. — X;. + tr—m,a/2V Vig ih (7.60) 


where thm, is the (1 — a@)th quantile of the t-distribution t,_m. One can 
show that Ci;,.(X) is actually UMAU (exercise). Bonferroni’s level 1 — a 
simultaneous confidence intervals for 4;— pj, 1<i<j <m, are Cia, (X), 
1<i<j< m, where a, = 2a/[m(m — 1)].. When m is large, these 
confidence intervals are very conservative in the sense that the confidence 
coefficient of these intervals may be much larger than the nominal level 
1— a and these intervals may be too wide to be useful. 


If the normality assumption is removed, then Cj;,.(X) is l—a asymptot- 
ically correct as min{nq, ...,%m}— oo and max{nq, ...,%m}/min{n,...,%m} 
—c< oo. Therefore, Cij¢,(X), 1 <i < 7 < m, are simultaneous confi- 
dence intervals with asymptotic significance level 1 — a. 


One can establish similar results for the two-way balanced ANOVA mod- 
els in Example 6.19 (exercise). I 


7.5.2 Scheffé’s method in linear models 


Since multiple comparison in ANOVA models (or, more generally, linear 
models) is one of the most important applications of simultaneous confi- 
dence intervals, we now introduce Scheffé’s method for problems in linear 
models. Consider the normal linear model 


X= Ny(ZG,07 1), (7.61) 


where ( is a p-vector of unknown parameters, 0? > 0 is unknown, and Z 
is an n xX p known matrix of rank r < p. Let L be an s x p matrix of 


7.5. Simultaneous Confidence Intervals 521 


rank s < r. Suppose that R(L) C R(Z) and we would like to construct 
simultaneous confidence intervals for t7LG, where t € T = R* — {0}. 


Let B be the LSE of 3. Using the argument in Example 7.15, for each t € 
T, we can obtain the following confidence interval for t7 LG with confidence 
coefficient 1 — a: 


[i LB = thnej2oV0 Dt, ULB + tr_rejasVt Dt], 


where 6? = ||X—Zl|?/(n—r), D = L(Z7Z)~ L’, and tn_»,q is the (1—a)th 
quantile of the t-distribution t,,_,.. However, these intervals are not level 
1 — q@ simultaneous confidence intervals for t7 LG, t € T. 


Scheffé’s (1959) method of constructing simultaneous confidence inter- 
vals for t7 LG is based on the following equality (exercise): 


THi\2 
z*A-‘x= max (ley (7.62) 
yeR* yx0 yT Ay 


where z € R* and A is ak x k positive definite matrix. 


Theorem 7.10. Assume normal linear model (7.61). Let L be an s x p 
matrix of rank s <r. Assume that R(L) C R(Z) and D = L(Z7Z)L” is 
of full rank. Then 


C,(X) = [t7 LB — 6/scat? Dt, 7 LG 4+ 6/scat™Dt], teT, 


are simultaneous confidence intervals for t7L3, t € TJ, with confidence 
coefficient 1 — a, where 62 = ||X — Z||2/(n—r), T = R* — {0}, and cg is 
the (1 — a)th quantile of the F-distribution F ,_,. 

Proof. Note that t7L € C;(X) for all t € T is equivalent to 


LG — LB)’ D-(LB - L t7LB —t™ LB)? 
so? teT so2tT Dt 
Then the result follows from the fact that the quantity on the left-hand 
side of (7.63) has the F-distribution F;,--. 


If the normality assumption is removed but conditions in Theorem 3.12 
are assumed, then Scheffé’s intervals in Theorem 7.10 are 1 — a@ asymptot- 
ically correct (exercise). 

The choice of the matrix L depends on the purpose of the analysis. One 
particular choice is L = Z, in which case t7’ LG is the mean of t7X. When 
Z is of full rank, we can choose L = Ip, in which case {t7 LG: t € T} is the 
class of all linear functions of G. Another LE commonly used when Z is of 


522 7. Confidence Sets 


full rank is the following (p — 1) x p matrix: 


dy 10° Oc has Oe Sa 
[ie a Ae a (7.64) 
0°90) 800 was ay ad 


It can be shown (exercise) that when L is given by (7.64), 
{t7LB:t € RP — {O}} = {c7B: cE R® — {0}, c7 J =O}, (7.65) 


where J is the p-vector of ones. Functions c’( satisfying c’ J = 0 are 
called contrasts. Therefore, setting simultaneous confidence intervals for 
t7LG, t € T, with L given by (7.64) is the same as setting simultaneous 
confidence intervals for all nonzero contrasts. 


Although Scheffé’s intervals have confidence coefficient 1 — a, they are 
too conservative if we are only interested in t7 LG for t in a subset of T. Ina 
one-way ANOVA model (Example 7.27), for instance, multiple comparison 
can be carried out using Scheffé’s intervals with @ = (111, ..., 4m), DL given 
by (7.64), and t € Jo that contains exactly m(m — 1)/2 vectors (Exercise 
110). The resulting Scheffé’s intervals are (Exercise 110) 


[ Xj. — Xj. — \fS8Cavij, Xe X;. + 4/stati; |, te To, (7.66) 


where X;. and v;; are given in (7.60). Since J contains a much smaller 
number of elements than JT, the simultaneous confidence intervals in (7.66) 
are very conservative. In fact, they are often more conservative than Bonfer- 
roni’s intervals derived in Example 7.27 (see Example 7.29). In the follow- 
ing example, however, Scheffé’s intervals have confidence coefficient 1 — a, 
although we consider t € Jo C T. 


Example 7.28 (Simple linear regression). Consider the special case of 
model (7.61) where 


X; = N(Bo + 612i,07), t=1,...,n, 


and 2; € R satisfying S, = )7y_,(% — 2)? > 0, Z =n ' SL, %. In this 
case, we are usually interested in simultaneous confidence intervals for the 
regression function G9 + Giz, z € R. Note that the result in Theorem 7.10 
(with L = Jy) can be applied to obtain simultaneous confidence intervals 
for Boy + Giz,t €T = R? — {0}, where t = (y, z). If we let y = 1, Scheffé’s 
intervals in Theorem 7.10 are 


[ Bo + Biz — 6/2caD(z), Bo + Biz +6 2caD(z)], zER (7.67) 


7.5. Simultaneous Confidence Intervals 523 


(exercise), where D(z) = n~! + (z — 2)?/S,. Unless 


(Bo + Biz — Bo — Biz)? _ (Boy + Biz — Boy — Biz)? 
ner De) Pe eae t™(Z*Z)-1t Ve88) 
holds with probability 1, where Z is the n x 2 matrix whose ith row is the 
vector (1, z;), the confidence coefficient of the intervals in (7.67) is larger 
than 1— a. We now show that (7.68) actually holds with probability 1 so 
that the intervals in (7.67) have confidence coefficient 1 — a. First, 


P(n(Bo — Bo) + n(B1 — 1)2 #0) =1. 


Second, it can be shown (exercise) that the maximum on the right-hand 
side of (7.68) is achieved at 


Sates io = ge OE 2 Bo — Bo 
t= ( z ) 7 n(Bo — Bo) + n(B1 — B1)z ( BB i: (7.69) 


Finally, (7.68) holds since y in (7.69) is equal to 1 (exercise). 


7.5.3 Tukey’s method in one-way ANOVA models 


Consider the one-way ANOVA model in Example 6.18 (and Example 7.27). 
Note that both Bonferroni’s and Scheffé’s simultaneous confidence intervals 
for fi — fj, 1 <i <j <™m, are not of confidence coefficient 1— a and often 
too conservative. Tukey’s method introduced next produces simultaneous 
confidence intervals for all nonzero contrasts (including the differences ju; — 
tj, 1< i<j <m) with confidence coefficient 1 — a. 

Let 6? = SSR/(n — _m), where SSR is given in Example 7.27. The 
studentized range is defined to be 
xX, — up; 

1<i<j<m oO 

Note that the distribution of R,, does not depend on any unknown param- 
eter (exercise). 


Theorem 7.11. Assume the one-way ANOVA model in Example 6.18. 
Let gq be the (1 — a)th quantile of R,- in (7.70). Then Tukey’s intervals 


[e7B —qae4, 7B +qade4],  cER™—{0},c7 J =0, 


are simultaneous confidence intervals for c7 6, cE R™ — {0}, c7 J = 0, with 
confidence coefficient 1 — a, where c+ is the sum of all positive components 
of c, 8 = (1, +; lm), 8 = (X1.,...,Xm.), and J is the m-vector of ones. 


524 7. Confidence Sets 


Proof. Let Y; = (X;.—pi)/G and Y = (Yj, ..., Ym). Then the result follows 
if we can show that 


Y¥; — Y;| < do 71 
sen Yl Gl <q (7.71) 


is equivalent to 

Ic Y| < dace for allc € R™ satisfying c’ J =0,c 40. (7.72) 
Let c(t, 7) = (c1,...,¢m) with ¢ = 1, ¢c; =—1, and cq =0 fori Fiorl A). 
Then c(i, j)+ = 1 and |[c(i,7)|"Y| = |Y; — Y;| and, therefore, (7.72) implies 
(7.71). Let c = (c1,...,¢m) be a vector satisfying the conditions in (7.72). 
Define —c_ to be the sum of negative components of c. Then 


\c7Y| = na Cy S- cj¥j +e S- ciY; 


C4 
jpicz <0 t:¢; >0 
1 
=—_ ) ) cic; Y; = ) ) cic; Yj 
CL 
t:c;>0 j:cj <0 jicg <0 t:c;>0 


== ys. 2 See) 


Tl 4:e,>0 gre; <0 


= DY leveyll¥ - Wa 


T t:¢,>0 j:cj7<0 


IA 
| 


1 
< max (|Y;-—Y;| | — Ci ||C; 
Sm | : , At 2, po , s 
= ys Y; ’ 
pee NG Ge 
where the first and the last equalities follow from the fact that c_ = c, £0. 
Hence (7.71) implies (7.72). 1 


Tukey’s method works well when n;’s are all equal to no, in which case 
values of \/noga can be found using tables or statistical software. When 
n,’s are unequal, some modifications are suggested; see Tukey (1977) and 
Milliken and Johnson (1992). 


Example 7.29. We compare the t-type confidence intervals in (7.60), 
Bonferroni’s, Scheffé’s, and Tukey’s simultaneous confidence intervals for 
[ui — bj, 1 <i < Jj <3, based on the following data X;; given in Mendenhall 
and Sincich (1995): 


393 520 236 1384 55 166 415 153 
433 94 535 327 214 135 280 304 


7.5. Simultaneous Confidence Intervals 525 


In this example, m = 3, nj = no = 10, Xy. = 229.6, Xo. = 309.8, X3. = 
427.8, and ¢ = 168.95. Let a = 0.05. For the t-type intervals in (7.60), 
t27,0.975 = 2.05. For Bonferroni’s method, An = a/3 = 0.017 and ¢27,0.983 — 
2.55. For Scheffé’s method, co.95 = 3.35 and /2co.95 = 2.59. From Table 
13 in Mendenhall and Sincich (1995, Appendix II), \/nogo.05 = 3.49. The 
resulting confidence intervals are given as follows. 


Parameter 
Method La — ps3 Length 
t-type [—235.2,74.6] [—353.1,—43.3] [—272.8,37.0] | 309.8 
Bonferroni | [—273.0,112.4] [—390.9,—5.5] [—310.6,74.8] | 385.4 
Scheffé [—276.0,115.4] [—393.9,-2.5] [—313.6,77.8] | 391.4 
Tukey (—267.3,106.7] [—385.2,—11.2] [—304.9,69.1] | 374.0 


Apparently, t-type intervals have the shortest length, but they are not si- 
multaneous confidence intervals. Tukey’s intervals in this example have the 
shortest length among simultaneous confidence intervals. Scheffé’s intervals 
have the longest length. I 


7.5.4 Confidence bands for c.d.f.’s 


Let X,,...,X, be iid. from a continuous c.d.f. F on RR. Consider the 
problem of setting simultaneous confidence intervals for F(t), t € R. A class 
of simultaneous confidence intervals indexed by t € Fe is called a confidence 
band. For example, the class of intervals in (7.67) is a confidence band with 
confidence coefficient 1 — a. 


First, consider the case where F is in a parametric family, i.e., F = Fo, 
6€¢OCc R*. If @ is real-valued and F(t) is nonincreasing in 6 for every t 
(e.g., when the parametric family has monotone likelihood ratio; see Lemma 


6.3) and if [@, 6] is a confidence interval for 6 with confidence coefficient (or 
significance level) 1 — a, then 


[Fo(t), Fa()], 9 te R, 


are simultaneous confidence intervals for F(t), t € R, with confidence co- 
efficient (or significance level) 1 — a. One-sided simultaneous confidence 
intervals can be similarly obtained. 


When F' = Fg with a multivariate 0, there is no simple and general way 
of constructing a confidence band for F(t), t € R. We consider an example. 


Example 7.30. Let X1,..., Xn be iid. from N(p,07). Note that F(t) = 
® (=). If is unknown and o? is known, then, from the results in Ex- 
ample 7.14, a confidence band for F(t), t € R, with confidence coefficient 


526 7. Confidence Sets 


l—ais 
[> (SE - et) (+a), ter. 


A confidence band can be similarly obtained if o? is unknown and yp is 
known. 


Suppose now that both pz and o? are unknown. In Example 7.18, we 
discussed how to obtain a lower confidence bound @ for 6 = y/o. An upper 
confidence bound @ for @ can be similarly obtained. Suppose that both 6 
and @ have confidence coefficient 1 — a/4. Using inequality (7.59), we can 
obtain the following level 1 — a confidence band for F'(t), t € R: 


[e(et-#).0(89-0]. ter 


where na = D2 ar-ajal(— DP, baw = WE rajal(— DP, and 
Neti is the (1 — a)th quantile of the chi-square distribution y2_,. 


Consider now the case where F' is in a nonparametric family. Let 
Dn(F) = supjer |Fn(t)—F(t)|, which is related to the Kolmogorov-Smirnov 
test statistics introduced in §6.5.2, where F;, is the empirical c.d.f. given by 
(5.1). From Theorem 6.10(i), there exists a cg such that 


P(Dn(F) < ce) =l-a. (7.73) 


Then a confidence band for F(t), t € R, with confidence coefficient 1 — a 
is given by 
[Fn(t) — ca, Fr(t) + ca | tER. (7.74) 


When n is large, we may approximate c, using the asymptotic result in 
Theorem 6.10(ii), i.e., we can replace (7.73) by 


Co 


So (-1 te 2? = — (7.75) 


j=l 
The resulting intervals in (7.74) have limiting confidence coefficient 1 — a. 


Using D7 (F) = supyer[Fn(t) — F(t)] and the results in Theorem 6.10, 
we can also obtain one-sided simultaneous confidence intervals for F'(¢), 
t € R, with confidence coefficient 1 — a or limiting confidence coefficient 
l-a. 

When n is small, it is possible that some intervals in (7.74) are not within 
the interval [0,1]. This is undesirable since F'(t) € [0,1] for all t. One way 
to solve this problem is replacing F;,(t) — ca, and F,,(t)+ ca by, respectively, 
max{F;,(t) —ce, 0} and min{ F,,(t)+ ca, 1}. But the resulting intervals have 
a confidence coefficient larger than 1—a. The limiting confidence coefficient 
of these intervals is still 1 — a@ (exercise). 


7.6. Exercises 527 


7.6 Exercises 


1. Let Xi1,...,Xin;, ¢ = 1,2, be two independent samples i.i.d. from 
N(wi,o7), i = 1,2, respectively, where all parameters are unknown. 
Let X; and S? be the sample mean and sample variance of the ith 
sample, 7 = 1, 2. 

(a) Let 6 = 1 — pe. Assume that 0, = a2. Show that 


HX.6 = (X1 — X2 — 0)/\/nz* + nz" 


is a pivotal quantity and construct a confidence interval for 6 with 
confidence coefficient 1 — a, using t(X, 6). 

(b) Let 6 = 03/07. Show that R(X,0) = S3/(AS7) is a pivotal 
quantity and construct a confidence interval for 6 with confidence 
coefficient 1 — a, using R(X, 0). 


2. Let X;, 7 = 1,2, be independent with the p.d.f.’s Nee Tey), 
i = 1,2, respectively. 
(a) Let 6 = Ay/Az. Show that 0X1/X>2 is a pivotal quantity and 
construct a confidence interval for @ with confidence coefficient 1— a, 
using this pivotal quantity. 
(b) Let 6 = (Ai, A2). Show that A,X1 + A2X- is a pivotal quantity 
and construct a confidence set for 6 with confidence coefficient 1— a, 
using this pivotal quantity. 


3. In Example 7.1, 
(a) obtain a pivotal quantity when #0 = (44,0) and discuss how to use 
it to construct a confidence set for 9 with confidence coefficient 1— a; 
(b) obtain the confidence set in part (a) when f is the p.d.f. of the 
exponential distribution E(0, 1). 


4. In Example 7.3, show that the equation n[Y(0)|? = #1,0:/25" (8) 


defines a parabola in @ and discuss when C'(X) is a finite interval, the 
complement of a finite interval, or the whole real line. 


5. Let X be a sample from P in a parametric family indexed by 6. 
Suppose that T(X) is a real-valued statistic with p.d-f. fg(t) and that 
k(t, 2) is a monotone function of ¢ for each 6. Show that if 


fot) = a(t, 6))] 58,0) 


for some function g, then #(T(X), 6) is a pivotal quantity. 


6. Let X1,..., Xn be iid. from N(6,@) with an unknown 0 > 0. Find a 
pivotal quantity and use it to construct a confidence interval for 0. 


528 


10. 


11. 


12. 


7. Confidence Sets 


. Prove (7.3). 


. Let Xy,...,X» be i.i.d. from the exponential distribution E(0,0) with 


an unknown 0 > 0. 

(a) Using the pivotal quantity X/0, construct a confidence interval 
for @ with confidence coefficient 1 — a. 

(b) Apply Theorem 7.1 with JT = X to construct a confidence interval 
for 6 with confidence coefficient 1 — a. 


. Let Xy,...,X, be iid. random variables with the Lebesgue p.d_f. 


a (are. I(o,9)(x), where a > 1 is known and 6 > 0 is unknown. 

(a) Apply Theorem 7.1 with T = X pn) to construct a confidence 
interval for 6 with confidence coefficient 1 — a. Compare the result 
with that in Example 7.2 when a = 1. 

(b) Show that the confidence interval in (a) can also be obtained using 
a pivotal quantity. 


Let X1,...,X» be iid. from the exponential distribution E(a, 1) with 
an unknown a. 

(a) Construct a confidence interval for a with confidence coefficient 
1—a by using Theorem 7.1 with T = X(1). 

(b) Show that the confidence interval in (a) can also be obtained using 
a pivotal quantity. 


Let X be a single observation from the uniform distribution 
U(O—5,9+ 5), where OER. 

(a) Show that X —@ is a pivotal quantity and that a confidence inter- 
val of the form [X + c, X +d] with some constants —3 <c<d< 4 
has confidence coefficient 1 — a if and only if its length is 1—a. 

(b) Show that the c.d.f. F(x) of X is nonincreasing in @ for any x 
and apply Theorem 7.1 to construct a confidence interval for 6 with 
confidence coefficient 1 — a. 


Let X4,...,Xn be iid. from the Pareto distribution Pa(a,@), 0 > 0, 
a> 0. 

(a) When @ is known, derive a confidence interval for a with confidence 
coefficient 1—a by applying Theorem 7.1 with T’ = X(1), the smallest 
order statistic. 

(b) When both a and @ are unknown and n > 2, derive a confidence 
interval for @ with confidence coefficient 1 — a by applying Theorem 
7.1 with T = TI", (Xi/X ay). 

(c) Show that the confidence intervals in (a) and (b) can be obtained 
using pivotal quantities. 

(d) When both a and @ are unknown, construct a confidence set. for 
(a,0) with confidence coefficient 1— a by using a pivotal quantity. 


7.6. Exercises 529 


13. 


14. 


15. 
16. 


ag 


18. 


19. 


20. 


Let Xj,..., Xn be iid. from the Weibull distribution W(a, 6), where 
a > 0 and 6 > 0 are unknown. Show that R(X,a,0) = []j_, (X#/6) 
is pivotal. Construct a confidence set for (a,9) with confidence coef- 
ficient 1 — a by using R(X, a, A). 


Consider Exercise 17 in 86.6. Construct a level 1— a confidence inter- 
val for 0 based on the observation X. Find a condition under which 
the derived confidence interval has confidence coefficient 1 — a. 


Prove (7.4). 


Let X1,..., Xp be ii.d. binary random variables with P(X; = 1) = 
p. Using Theorem 7.1 with T = )7i_, X;, show that a level 1 — a 
confidence interval for p is 


1 Ett Fo741),2(n—T),04 
1 et Pyn_rai) eta) 1+ SR Fyr4t),2(n-7),01 | 


where a1+a2 = Q, Fyb,o is the (1—a)th quantile of the F-distribution 
Fu», and Fy0,q is defined to be oo. (Hint: show that P(T > t) = 
P(Y <p), where Y has the beta distribution B(t,n — t + 1).) 


Let X be a sample of size 1 from the negative binomial distribution 
NB(p,r) with a known r and an unknown p € (0,1). Using Theorem 
7.1 with T = X —r, show that a level 1 — a confidence interval for p 
is 
1 Portia 
14+ Pyrsy ena, 1+ pPar27,01 


where aj + a2 = a and Fy». is the same as that in the previous 
exercise. 


Let T be a statistic having the noncentral chi-square distribution 
x2(0) (see §1.3.1), where @ > 0 is unknown and r is a known positive 
integer. Show that the c.d-f. Fro(t) of T is nonincreasing in 6 for 
each fixed ¢ and use this result to construct a confidence interval for 
@ with confidence coefficient 1 — a. 


Repeat the previous exercise when x2(0) is replaced by the noncentral 
F-distribution F;, (0) (see §1.3.1) with unknown 6 > 0 and known 
positive integers r; and ro. 


Consider the one-way ANOVA model in Example 6.18. Let fj = 
nt yu nips and 06 = 07 0", ni(ui — f)?. Construct an upper 
confidence bound for @ that has confidence coefficient 1 — a and is a 
function of T = (n —m)(m—1)~!SST/SSR. 


530 


21. 


22. 


23. 


24. 


25. 


26. 


27. 


28. 


29. 


7. Confidence Sets 


Prove Proposition 7.2 and provide a sufficient condition under which 
the test T(X) = 1 — I4(9,)(X) has size a. 


In Example 7.7, 

(a) show that c(@) and c;(@)’s are nondecreasing in 6; 

(b) show that (p, 1] with p given by (7.5) is a level 1 — a confidence 
interval for p; a 

(c) compare the interval in (b) with the interval obtained using the 
result in Exercise 16 with a; = 0. 


Show that the confidence intervals in Example 7.14 and Exercise 1 
can also be obtained by inverting the acceptance regions of the tests 
for one-sample and two-sample problems in §6.2.3. 


Let X;, 1 = 1,2, be independently distributed as the binomial distri- 
butions Bi(p;,n;), 7 = 1,2, respectively, where n;’s are known and 
p;’s are unknown. Show how to invert the acceptance regions of the 
UMPU tests in Example 6.11 to obtain a level 1—a confidence interval 
for the odds ratio ae 

Let X1,..., Xn be iid. from N(p, 07). 

(a) Suppose that 0? = yy? with unknown y > 0 and p € R. Obtain 
a confidence set for y with confidence coefficient 1 — a by inverting 
the acceptance regions of LR tests for Hp : y = yo versus Hy : 7 4 0. 
(b) Repeat (a) when o? = yy with unknown y > 0 and p> 0. 


Consider the problem in Example 6.17. Discuss how to construct a 
confidence interval for @ with confidence coefficient 1 — a by 

(a) inverting the acceptance regions of the tests derived in Example 
6.17; 

(b) applying Theorem 7.1. 


Let X1,...,X» be iid. from the uniform distribution U(@— 4,0+ 4), 
where 0 € R. Construct a confidence interval for @ with confidence 
coefficient 1 — a. 


Let X1,..., Xn be iid. binary random variables with P(X; = 1) = p. 
Using the p.d.f. of the beta distribution B(a,b) as the prior p.d.f., 
construct a level 1 — a HPD credible set for p. 


Let X1,...,Xn be iid. from N(y,07) with an unknown 6 = (p,07). 
Consider the prior Lebesgue p.d.f. 7(@) = 71(y\o?)72(07), where 
m(u\o7) is the p.d.f. of N(uo, of07), 


1 Ieee : 
™2(07) = T(abe (=) eile cena, 


7.6. Exercises 531 


30. 


31. 


32. 


33. 


34. 
35. 


36. 


and fo, 7, a, and b are known. 

(a) Find the posterior of 4 and construct a level 1 — a HPD credible 
set for p. 

(b) Show that the credible set in (a) converges to the confidence 
interval obtained in Example 7.14 as o@, a, and b converge to some 
limits. 


Let X1,...,Xn be iid. with a Lebesgue p.d-f. +f (=), where f is 
a known p.d.f. and 4 and o > 0 are unknown. Let Xo be a future 
observation that is independent of X;’s and has the same distribution 
as X;. Find a pivotal quantity R(X, Xo) and construct a level 1 —a 
prediction set for Xo. 


Let X1,...,Xp be iid. from a continuous c.d.f. F on R and Xo 
be a future observation that is independent of X,’s and has the 
c.d.f. F. Suppose that F is strictly increasing in a neighborhood 
of F~'(a/2) and a neighborhood of F~!(1 — a/2). Let F, be the 
empirical c.d.f. defined by (5.1). Show that the prediction interval 
C(X) = [F71(a/2), F71(1 — a/2)] for Xo satisfies P(Xo € C(X)) - 
1 — a, where P is the joint distribution of (Xo, X1,..., Xn). 


Let X1,..., Xn be ii.d. with a Lebesgue p.d.f. f(a — u), where f is 
known and yz is unknown. 

(a) If f is the p.d.f. of the standard normal distribution, show that the 
confidence interval [xX ae, Ge C1] is better than [X1— ce, X1 +c] in 
terms of their lengths, where c,;’s are chosen so that these confidence 
intervals have confidence coefficient 1 — a. 

(b) If f is the p.d.f. of the Cauchy distribution C(0, 1), show that the 
two confidence intervals in (a) have the same length. 


Let X1,...,Xn (n > 1) be iid. from the exponential distribution 
E(6,6), where 6 > 0 is unknown. 

(a) Show that both X/@ and X(1)/9 are pivotal quantities, where ae 
is the sample mean and X(1) is the smallest order statistic. 

(b) Obtain confidence intervals (with confidence coefficient 1— a) for 
6 based on the two pivotal quantities in (a). 

(c) Discuss which confidence interval in (b) is better in terms of the 
length. 


Prove Theorem 7.3(ii). 


Show that the expected length of the interval in (7.13) is shorter than 
the expected length of the interval in (7.12). 


Consider Example 7.14. 
(a) Suppose that 0 = 0? and yp is known. Let a, and b, be constants 


532 


37. 


38. 


39. 


40. 


Al. 


42. 


7. Confidence Sets 


satisfying a2g(a,) = b2g(b.) > 0 and ie g(x)dx = 1— a, where g is 
the p.d.f. of the chi-square distribution y2. Show that the interval 
(b; 'T,a,!T] has the shortest length within the class of intervals of 
the form [b-'T,a~!T], if g(x)dxz = 1—a, where T = So", (Xi —p)?. 
(b) Show that the expected length of the interval in (a) is shorter 
than the expected length of the interval in (7.14). 

(c) Find the shortest-length interval for 9 = o within the class of 
confidence intervals of the form [b~!/?./n — 18, a~!/?,/n — 1S], where 
0<a<b<am, i f(a)dx = 1—a, and f is the p.d-f. of the chi-square 
distribution y?_,. 


Assume the conditions in Theorem 7.3(i). Assume further that f 
is symmetric. Show that a, and b, in Theorem 7.3(i) must satisfy 
Ax = —by. 


Let f be a Lebesgue p.d.f. that is nonzero in [x_, x4] and is 0 outside 
[c_, x4], -co<a_<a@4< om. 

(a) Suppose that f is strictly decreasing. Show that, among all in- 
tervals [a,b] satisfying p f(x)dz = 1-— a, the shortest interval is 
obtained by choosing a = x_ and b so that i f(a)dx =1—-a. 

(b) Obtain a result similar to that in (a) when f is strictly increasing. 
(c) Show that the interval [X(,),a7!/" X(,)] in Example 7.13 has the 
shortest length among all intervals [b~'X(n),a~'X(,)]- 


Let X1,...,X» be iid. from the exponential distribution E(a, 1) with 
an unknown a. Find a confidence interval for a having the shortest 
length within the class of confidence intervals [X (1) +¢, X(1) +d] with 
confidence coefficient 1 — a. 


Consider the HPD credible set C(x) in (7.7) for a real-valued 6. Sup- 
pose that p,(@) is a unimodal Lebesgue p.d.f. and is not monotone. 
Show that C(z) is an interval having the shortest length within the 


class of intervals [a, b] satisfying ts px (O)dd =1—a. 


Let X be a single observation from the gamma distribution I'(a, y) 
with a known @ and an unknown y. Find the shortest-length confi- 
dence interval within the class of confidence intervals [b~1X,a7!X] 
with a given confidence coefficient. 


Let X1,..., Xn be iid. with the Lebesgue p.d.f. 6x°—*Io,1)(2), where 
@? > 0 is unknown. 

(a) Construct a confidence interval for 6 with confidence coefficient 
1 — a, using a sufficient statistic. 

(b) Discuss whether the confidence interval obtained in (a) has the 


7.6. Exercises 533 


43. 


44. 


45. 


A6. 
AT. 


48. 


AQ. 


50. 


51. 


52. 


shortest length within a class of confidence intervals. 
(c) Discuss whether the confidence interval obtained in (a) is UMAU. 


Let X be a single observation from the logistic distribution LG(w, 1) 
with an unknown p € R. Find a O’/-UMA upper confidence bound 
for with confidence coefficient 1 — a, where 9! = (1, 00). 


Let X1,...,X» be iid. from the exponential distribution E(0,@) with 
an unknown 6 > 0. Find a O/-UMA lower confidence bound for @ 
with confidence coefficient 1 — a, where 0’ = (0,0). 


Let X be a single observation from N(6 — 1,1) if @ < 0, N(0,1) if 
6=0, and N(@+1,1) if@ > 0. 

(a) Show that the distribution of X is in a family with monotone 
likelihood ratio. 

(b) Construct a O’-UMA lower confidence bound for @ with confidence 
coefficient 1 — a, where 0’ = (—oo, 6). 


Show that the confidence set in Example 7.9 is unbiased. 


In Example 7.13, show that the confidence interval [X(,), a“ VX (,)] 
is UMA and has the shortest expected length among all confidence 
intervals for @ with confidence coefficient 1 — a. 


Let X1,...,X» be i.i.d. from the exponential distribution E(a, 0) with 
unknown a and 9. Find a UMA confidence interval for a with confi- 
dence coefficient 1 — a. 


Let Y and U be independent random variables having the binomial 
distribution Bi(p,n) and the uniform distribution U(0,1), respec- 
tively. 

(a) Show that W = Y +U has the Lebesgue p.d-f. f,(w) given by 
(7.17). 

(b) Show that the family {f, : p € (0,1)} has monotone likelihood 
ratio in W. 


Extend the results in the previous exercise to the case where the 
distribution of Y is the power series distribution defined in Exercise 
13 of §2.6. 


Let X1,..., Xp, be iid. from the Poisson distribution P(9) with an 
unknown 9 > 0. Find a randomized UMA upper confidence bound 
for 6 with confidence coefficient 1 — a. 


Let X be a nonnegative integer-valued random variable from a pop- 
ulation P € P. Suppose that P is parametric and indexed by a 
real-valued 6 and has monotone likelihood ratio in X. Let U be a 


534 


53. 


54. 


55. 
56. 


57. 
58. 


59. 


60. 


61. 


7. Confidence Sets 


random variable from the uniform distribution U(0,1) that is inde- 
pendent of X. Show that a UMA lower confidence bound for @ with 
confidence coefficient 1 — a is the solution of the equation 


UFs(X) + (1—U)Fe(X -1) =1-a 
(assuming that a solution exists), where Fy(«) is the c.d.f. of X. 


Let X be a single observation from the hypergeometric distribution 
HG(r,n,@ —n) (Table 1.1) with known r, n, and an unknown @ = 
n+1,n+2,.... Derive a randomized UMA upper confidence bound 
for 9 with confidence coefficient 1 — a. 


Let X1,..., Xn be iid. from N(y,07) with unknown p and o?. 

(a) Show that 0 = X + tn-1,99/,/n is a UMAU upper confidence 
bound for 4. with confidence coefficient 1 — a, where ty_1,. is the 
(1 — a)th quantile of the t-distribution tp_1. 

(b) Show that the confidence bound in (a) can be derived by inverting 
acceptance regions of LR tests. 


Prove Theorem 7.7 and Proposition 7.3. 


Let Xi,...,Xn be iid. with p.d.f. f(z — 0), where f is a known 
Lebesgue p.d.f. Show that the confidence interval [X — c1, X + c9] 
has constant coverage probability, where c; and co are constants. 


Prove the claim in Example 7.18. 


In Example 7.19, show that 

(a) the testing problem is invariant under G,,,, but not G; 

(b) the nonrandomized test with acceptance region A(ji9) is UMP 
among unbiased and invariant tests of size a, under Gio; 

(c) G is the smallest group containing U,,cRGuo- 


In Example 7.17, show that intervals (7.13) and (7.14) are UMA 
among unbiased and invariant confidence intervals with confidence 
coefficient 1 — a, under G; and G, respectively. 


Let X;, 7 = 1,2, be independent with the exponential distributions 
E(0,6;), i = 1,2, respectively. 

(a) Show that [aY/(2—a), (2—a)Y/a] is a UMAU confidence interval 
for 62/0, with confidence coefficient 1— a, where Y = X2/X1. 

(b) Show that the confidence interval in (a) is also UMAI. 


Let X1,..., Xp be i.i.d. from a bivariate normal distribution with un- 
known mean and covariance matrix and let R(X) be the sample corre- 
lation coefficient. Define p = C~'(R(X)), where C() is determined 
by 

P(R(X) < C(p)) =1-a 


7.6. Exercises 535 


62. 


63. 


64. 


65. 


66. 


67. 


and p is the unknown correlation coefficient. Show that p is a ©’ 
UMAI lower confidence bound for p with confidence coefficient 1 — a, 
where 0/ = (—1,p). 


Let Xi1,...,Xin;, t = 1,2, be two independent samples i.i.d. from 
N(;,07), i = 1,2, respectively, where u;’s are unknown. Find a 
UMAI confidence interval for 2 — 41 with confidence coefficient 1—a 
when (a) o? is known; (b) o? is unknown. 


Consider Exercise 1. Let 6 = 1 — po. 

(a) Show that R(X, 0) = (Xi — X2 — 0)/\/n7 1S? + nz'S? is asymp- 
totically pivotal, assuming that ni/n2 — c € (0,00). Construct a 
1 —a asymptotically correct confidence interval for 0 using R(X, 8). 
(b) Show that t(X, 6) defined in Exercise l(a) is asymptotically piv- 
otal if either n1/nz > 1 or 01 = 02 holds. 


In Example 7.23, show that C3(X) = [p_,p;] with the given p+. 
Compare the lengths of the confidence intervals C2(X) and C3(X). 


Show that the confidence intervals in Example 7.14 can be derived by 
inverting acceptance regions of LR tests. 


Let X1,...,X» be iid. from the exponential distribution E(0, 6) with 
an unknown 6 > 0. 

(a) Show that R(X, 0) = /n(X —6)/@ is asymptotically pivotal. Con- 
struct a 1 — a asymptotically correct confidence interval for 0, using 
R(X, A). 

(b) Show that #,(X,0) = /n(X — 6)/X is asymptotically pivotal. 
Construct a 1 — @ asymptotically correct confidence interval for 6, 
using 31 (X, 0). 

(c) Obtain 1 — a asymptotically correct confidence intervals for @ by 
inverting acceptance regions of LR tests, Wald’s tests, and Rao’s score 
tests. 


Let X1,..., Xn be ii.d. from the Poisson distribution P(#) with an 
unknown @ > 0. 

(a) Show that R(X,0) = (X — 0)/,/6/n is asymptotically pivotal. 
Construct a 1 — @ asymptotically correct confidence interval for 6, 
using R(X, 0). 

(b) Show that 3ty(X,6) = (X — 0)/,/X/n is asymptotically pivotal. 
Construct a 1 — @ asymptotically correct confidence interval for 6, 
using 31 (X, 0). 

(c) Obtain 1 — a asymptotically correct confidence intervals for @ by 
inverting acceptance regions of LR tests, Wald’s tests, and Rao’s score 
tests. 


536 


68 


69. 


70. 


71. 


72. 


73. 


74. 


75. 


76. 


77. 


7. Confidence Sets 


. Suppose that Xy,...,X» are ii.d. from the negative binomial distri- 
bution NB(p,r) with a known r and an unknown p. Obtain 1 — a 
asymptotically correct confidence intervals for p by inverting accep- 
tance regions of LR tests, Wald’s tests, and Rao’s score tests. 


Suppose that X1,...,X, are i.i.d. from the log-distribution L(p) with 
an unknown p. Obtain 1— a asymptotically correct confidence inter- 
vals for p by inverting acceptance regions of LR tests, Wald’s tests, 
and Rao’s score tests. 


In Example 7.24, obtain 1 — a asymptotically correct confidence sets 
for ys by inverting acceptance regions of LR tests, Wald’s tests, and 
Rao’s score tests. Are these sets always intervals? 


Let X1,..., Xp be iid. from the gamma distribution ['(6, 7) with un- 
known @ and y. Obtain 1 — a asymptotically correct confidence sets 
for 6 by inverting acceptance regions of LR tests, Wald’s tests, and 
Rao’s score tests. Discuss whether these confidence sets are intervals 
or not. 


Consider the problem in Example 3.21. Construct an asymptotically 
pivotal quantity and a 1—a asymptotically correct confidence set for 


My/ Ma: 


Consider the problem in Example 3.23. Construct an asymptotically 
pivotal quantity and a 1—a asymptotically correct confidence set for 
R(t) with a fixed t. 


Let U, be a U-statistic based on iid. Xy,...,X, and the kernel 
h(a1,...,%m), and let 0 = E(U,,). Construct an asymptotically pivotal 
quantity based on U,, and a 1 — a asymptotically correct confidence 
set for 6. 


Let X1,...,X, be ii.d. from ac.d.f. F on R that is continuous and 
symmetric about @. Let 6 = W/n—4 and T(F) = +4, where W and 
T are given by (6.83) and (5.53), respectively. Construct a confidence 
interval for 6 that has limiting confidence coefficient 1 — a. 


Consider the problem in Example 5.15. Construct an asymptotically 
pivotal quantity and a 1—a asymptotically correct confidence set for 
8. Compare this confidence set with those in Example 7.24. 


Consider the linear model X = ZG + ¢, where € has independent 
components with mean 0 and Z is of full rank. Assume the conditions 
in Theorem 3.12. 

(a) Suppose that Var(e) = 07D, where D is a known diagonal matrix 
and o? is unknown. Find an asymptotically pivotal quantity and 


7.6. Exercises 537 


78. 


79. 


80. 


81. 


82. 


83. 


84. 


construct a 1 — a@ asymptotically correct confidence set for (. 

(b) Suppose that Var(e) is an unknown diagonal matrix. Find an 
asymptotically pivotal quantity and construct a 1 — a asymptotically 
correct confidence set for (. 


In part (a) of the previous exercise, obtain a 1 — a asymptotically 
correct confidence set for 3/c. 


Consider a GEE estimator 6 of @ described in 85.4.1. Discuss how to 
construct an asymptotically pivotal quantity and a 1 — a asymptoti- 
cally correct confidence set for 6. (Hint: see §5.5.2.) 


Let X1,..., Xp be i.i.d. from the exponential distribution E(a, 0) with 
unknown a and @. Find a 1 — a asymptotically correct confidence set 
for (a,@) by inverting acceptance regions of LR tests. 


Let Xi1,...,Xin;, = 1,2, be two independent samples i.i.d. from 
N (i, 07), i = 1,2, respectively, where all parameters are unknown. 
(a) Find 1 — a asymptotically correct confidence sets for (f11, 42) by 
inverting acceptance regions of LR tests, Wald’s tests, and Rao’s score 
tests. 

(b) Repeat (a) for the parameter (11, 12,07, 03). 

(c) Repeat (a) under the assumption that a7 = 03 = 0. 


(d) Repeat (c) for the parameter (j01, f2, 07). 


Let Xi1,..., Xin,, 7 = 1,2, be two independent samples i.i.d. from the 
exponential distributions E(0,6;), 7 = 1,2, respectively, where 0;’s 
are unknown. Find 1 — a asymptotically correct confidence sets for 
(0,02) by inverting acceptance regions of LR tests, Wald’s tests, and 
Rao’s score tests. 


Consider the problem in Example 7.9. Find 1 — a asymptotically 
correct confidence sets for 6 by inverting acceptance regions of LR 
tests, Wald’s tests, and Rao’s score tests. Which one is the same as 
that derived in Example 7.9? 


Let Xj,...,X, be iid. from a continuous c.d.f. F on R and let 6 = 
F~*(p), p € (0,1). 

(a) Show that P(X) <d< X (ko)) = PU (k:) <p< Uks))s where 
Xx) is the kth order statistic and U(x) is the kth order statistic based 
on a sample Uj,...,U, i.i.d. from the uniform distribution U(0, 1). 
(b) Show that 


PU (K:) <p< Uk2)) => By(ki,n — ky + 1) = By(ke,n — ko + 1), 


where 


r= Ha [ee on 


538 


85. 


86. 


87. 


88. 


89. 


90. 


91. 


92. 


93. 


7. Confidence Sets 


(c) Discuss how to obtain a confidence interval for 0 with confidence 
coefficient 1 — a. 


Prove Corollary 7.1. 


Assume the conditions in Corollary 7.1. 

(a) Show that /n(X(x,,) — 9)F"(8@) >a N(c, p(1 — p)). 

(b) Prove result (7.24) using the result in part (a). 

(c) Construct a consistent estimator of the asymptotic variance of the 
sample median (see Example 6.27), using Woodruff’s interval. 


Prove (7.25) and (7.26). 
In Example 7.25, prove that Vn? Gn — 6) can be written as 
/nh(Y)/op, and find the explicit form of the function h. 


Let Xj,...,X» be iid. from an unknown c.d.f. F with E|X1|° < oo. 
Suppose that condition (1.105) is satisfied. Derive a second-order 
accurate lower confidence bound for 0? = Var(X1). 


Using the Edgeworth expansion given in Example 7.26, construct a 
third-order accurate lower confidence bound for p. 


Show that 0); in (7.34) is equal to 26,, — Kz'(1— a), where Kg is 
defined in (7.27). 


(Parametric bootstrapping in location-scale families). Let X1,..., Xn 
be iid. random variables with p.d.f. +f (=#), where f is a known 
Lebesgue p.d.f. and u and a > O are unknown. Let X7,...,X; be 
i.i.d. bootstrap data from the p.d_f. f (=), where & and s? are the 
observed sample mean and sample variance, respectively. 

(a) Suppose that we construct the bootstrap-t lower confidence bound 
(7.35) for « using the parametric bootstrap data. Show that 0,7 has 
confidence coefficient 1 — a. 

(b) Suppose that we construct the hybrid bootstrap lower confidence 
bound (7.34) for 4 using the parametric bootstrap data. Show that 
9:7— does not necessarily have confidence coefficient 1 — a. 

(c) Suppose that f has mean 0 and variance 1. Show that 0;;, in (b) 
is 1 — a asymptotically correct. 


(The bootstrap BC, percentile method). Suppose that we change 
assumption (7.30) to 


7.6. Exercises 539 


94. 


95. 


96. 


where a is an extra parameter called the acceleration constant and ® 
is the c.d.f. of the standard normal distribution. 

(a) If é,, Zo, and a are known, show that the following lower confi- 
dence bound for @ has confidence coefficient 1 — a: 


91 = b,' (bn + (Za + 20)(1 + abn) /[1 — a(Za + 20)]). 


(b) Show that Kg! (a) = 67! (dn + [®7(x) — z0](1 + adn)), where 
Kg is defined in (7.27). 

(c) Let Ogc(a) = Kg*(®(z0 + (2a + 20)/[1 — a(za + 20)])). Show 
that 0p¢(a) = Op, in part (a). (The bootstrap BC, percentile lower 
confidence bound for 0 is 8gc¢(@), where @ is an estimator of a.) 


(Automatic bootstrap percentile). Let P = {Py : 0 © R} be a para- 
metric family. Define Ko(x) = Po(6n < x), where 6, is an estimator 
of 6. Let @ be a given value of @ and 0; = Kj, (1-a). The automatic 
bootstrap percentile lower confidence bound for @ is defined to be 


Gapp= Kj (Ko, (90)). 


Assume the assumption in the previous exercise. Show that @,,p 
has confidence coefficient 1 — a. 


(Bootstrapping residuals). Consider linear model (3.25): X = ZG+e, 
where Z is of full rank and ¢ is a vector of i.i.d. random variables hav- 
ing mean 0 and variance o?. Let rj; = X; — ZG be the ith residual, 
where B is the LSE of 3. Assume that the average of r;’s is always 0. 
Let ef, ...,€%, be i.i.d. bootstrap data from the empirical c.d.f. putting 
mass n~! on each r;. Define X* = Z78+e%,i=1,...,n. 

(a) Find an expression for B*, the bootstrap analogue of 3. Calculate 
E(@*|X) and Var(*|X). 

(b) Using 17(G — B) and the idea in §7.4.1, construct a hybrid boot- 
strap lower confidence bound for 176, where 1 € R?. 

(c) Discuss when the lower confidence bound in (b) is 1 — a asymp- 
totically correct. 

(d) Describe how to construct a bootstrap-t lower confidence bound 
for 17. 

(e) Describe how to construct a hybrid bootstrap confidence set for 
G, using the idea in §7.4.1. 


(Bootstrapping pairs). Consider linear model (3.25): X = ZG +, 
where Z is of full rank and ¢ is a vector of independent random vari- 
ables having mean 0 and finite variances. Let (X7, Z7),...,(X, Z) 
be i.i.d. bootstrap data from the empirical c.d.f. putting mass n~! on 
each (X;,Z;). Define B* = (Z7Z)“1 7", Z*X*. Repeat (a)-(e) of 
the previous exercise. 


540 


97. 


98. 


99. 


100. 
101. 


102. 


103. 
104. 


105. 
106. 


107. 


108. 


7. Confidence Sets 


(External bootstrapping or wild bootstrapping). Assume the model 
in the previous exercise. Let ef, ...,e%, be ii.d. random variables with 
mean 0 and variance 1. Define the bootstrap data as X¥ = Z7 G+ 
|tsJe*, i = 1,...,n, where G is the LSE of , t; = (X; — Z7B)/VI— hi, 
and h; = Z7(Z7Z)~1Z;. Repeat (a)-(e) of Exercise 95. 


Prove (7.48) and (7.49). 


Describe how to approximate Coy p(X) in (7.51), using the Monte 
Carlo method. 


Prove (7.53) and (7.54). 


Show that Bonferroni’s simultaneous confidence intervals are of level 
l-a. 


Let Ct,.(X) be a confidence interval for 6; with confidence coefficient 
l—a,t=1.,...,k. Suppose that C1q(X),...,Ck,«(X) are independent 
for any a. Show how to construct simultaneous confidence intervals 
for 6:,t =1,...,k, with confidence coefficient 1 — a. 


Show that Ci;,4(X) in (7.60) is UMAU for pi; — p;. 


Consider the two-way balanced ANOVA model in Example 6.19. Us- 
ing Bonferroni’s method, obtain level 1 — a simultaneous confidence 
intervals for 

(a) a;,2=1,..,a—1; 

(b) Lij, = 1, see A, j — 1, way De 


Prove (7.62). (Hint: use the Cauchy-Schwarz inequality.) 


Let cx € R*, ye R*, and A be ak x k positive definite matrix. 
(a) Suppose that y7 A~tz = 0. Show that 


T ap \2 
g’ Aly = max aoe 
cER* c£0,cTy=0 CTAC 
(b) Assume model (7.61) with a full rank Z. Using the result in (a), 
construct simultaneous confidence intervals (with confidence coeffi- 
cient 1 — a) for c7B, cE R?, c #0, c’y = 0, where y € R? satisfies 
Z’ Ly =0. 
Assume the conditions in Theorem 3.12. Show that Scheffé’s intervals 


in Theorem 7.10 are 1 — a asymptotically correct. 


Assume the conditions in Theorem 3.12 and Theorem 7.10. Derive 
1 — @ asymptotically correct simultaneous confidence intervals for 
t7 LG/o. 


7.6. Exercises 541 


109. 
110. 


111. 


112. 


113. 


114. 


115. 


116. 


Prove (7.65). 


Find explicitly the m(m — 1)/2 vectors in the set Jp in (7.66) so 
that {t7LG:t € Jo} is exactly the same as pi — pj, 1 < i< y<m. 
Show that the intervals in (7.66) are Scheffé’s simultaneous confidence 
intervals. 


In Example 7.28, show that 

(a) Scheffé’s intervals in Theorem 7.10 with t = (1, z) and L = Ig are 
of the form (7.67); 

(b) the maximum on the right-hand side of (7.68) is achieved at t 
given by (7.69); 

(c) y in (7.69) is equal to 1 and (7.68) holds. 


Consider the two-way balanced ANOVA model in Example 6.19. Us- 
ing Scheffé’s method, obtain level 1 — a simultaneous confidence in- 
tervals for a;’s, 3;’s, and 73’s. 


Let Xj; = N(ut+ta;+;,07), i=1,...,a, j =1,...,b, be independent, 
where )>¥_, a; = 0 and Bar GB; =0. Construct level 1 — a@ simulta- 
neous confidence intervals for all linear combinations of a,’s and (;’s, 
using 

(a) Bonferroni’s method; 

(b) Scheffé’s method. 


Assume model (7.61) with 3 = (G0, G1, G2) and Z; = (1, ti, t?), where 
t, ER, 4 t; = 0, ae i? = 1, and an a = 0. 

(a) Construct a confidence ellipsoid for (1, G2) with confidence coef- 
ficient 1 — a; 

(b) Construct simultaneous confidence intervals for all linear combi- 
nations of 3; and (2, with confidence coefficient 1 — a. 


Show that the distribution of R,; in (7.70) does not depend on any 
unknown parameter. 


For a = 0.05, obtain numerically the t-type confidence intervals in 
(7.60), Bonferroni’s, Scheffé’s, and Tukey’s simultaneous confidence 
intervals for u;— j;, 1 <1 <j < 4, based on the following data X;; 
from a one-way ANOVA model (go.05 = 4.45): 


2 3 4 5 6 
0.10 0.09 0.07 0.09 0.06 
0.09 0.11 0.10 0.08 0.13 
0.10 0.15 0.09 0.09 0.17 
0.11 0.07 0.09 0.11 0.08 


542 


117. 


118. 


119. 


120. 


7. Confidence Sets 


(Dunnett’s simultaneous confidence intervals). Let Xo; (j = 1,...,no) 
and X;; (i = 1,...,m, 7 = 1,...,no) represent independent measure- 
ments on a standard and m competing new treatments. Suppose 
that Xi; = N(yi,07) with unknown p; and o? > 0, i = 0,1,...,m. 
Let X;. be the sample mean based on Ky, f= 1h i, and oo 
[(mm + 1) (m9 — WIE oy 2 (Xiy — Ki)? 

(a) Show that the distribution of 


Rs = ;_jnax | (X;. = pi) - (Xo. — po)|/¢ 


does not depend on any unknown parameter. 
(b) Show that 


m m m m 
ys cy X4. — qa > cil, S- cyX4. + gad )> \ci| 
i=0 i=1 i=0 i=l 


for all co, C1, ..., Cm Satisfying pene c; = 0 are simultaneous confidence 
intervals for Dea ci, with confidence coefficient 1 — a, where qq is 
the (1 — a)th quantile of Rez. 


Let Xj,...,X, be iid. from the uniform distribution U(0, 6), where 
8 > 0 is unknown. Construct a confidence band for the c.d.f. of X, 
with confidence coefficient 1 — a. 


Let X1,..., Xp be iid. with the p.d-f. if (=+), where f is a known 
Lebesgue p.d.f. (a location-scale family). Let F' be the c.d.f. of X1. 
(a) Suppose that  € R is unknown and oa is known. Construct 
simultaneous confidence intervals for F(t), t € R, with confidence 
coefficient 1 — a. 

(b) Suppose that is known and o > 0 is unknown. Construct 
simultaneous confidence intervals for F(t), t € R, with confidence 
coefficient 1 — a. 

(c) Suppose that 4 € R and o > 0 are unknown. Construct level 
1 —a simultaneous confidence intervals for F(t), t € R. 


Let X1,...,Xn be iid. from F on R and F,, be the empirical c.d.f. 
Show that the intervals 


[ max{F),(t) — ca, 0}, min{F,(t) + co, 1} |, tER, 


form a confidence band for F(t), t € R, with limiting confidence 
coefficient 1 — a, where cg is given by (7.75). 


References 


We provide some references for further readings on the topics covered in 
this book. 


For general probability theory, Billingsley (1986) and Chung (1974) are 
suggested, although there are many standard textbooks. An asymptotic 
theory for statistics can be found in Serfling (1980), Shorack and Wellner 
(1986), Sen and Singer (1993), Barndorff-Nielsen and Cox (1994), and van 
der Vaart (1998). 


More discussions of fundamentals of statistical decision theory and in- 
ference can be found in many textbooks on mathematical statistics, such as 
Cramér (1946), Wald (1950), Savage (1954), Ferguson (1967), Rao (1973), 
Rohatgi (1976), Bickel and Doksum (1977), Lehmann (1986), Casella and 
Berger (1990), and Barndorff-Nielsen and Cox (1994). Discussions and 
proofs for results related to sufficiency and completeness can be found in 
Rao (1945), Blackwell (1947), Hodges and Lehmann (1950), Lehmann and 
Scheffé (1950), and Basu (1955). More results for exponential families are 
given in Barndorff-Nielsen (1978). 


The theory of UMVUE in §3.1.1 and §3.1.2 is mainly based on Chapter 
2 of Lehmann (1983). More results on information inequalities can be 
found in Cramér (1946), Rao (1973), Lehmann (1983), and Pitman (1979). 
The theory of U-statistics and the method of projection can be found in 
Hoeffding (1948), Randles and Wolfe (1979), and Serfling (1980). The 
related theory for V-statistics is given in von Mises (1947), Serfling (1980), 
and Sen (1981). Three excellent textbooks for the theory of LSE are Scheffé 
(1959), Searle (1971), and Rao (1973). Additional materials for sample 
surveys can be found in Basu (1958), Godambe (1958), Cochran (1977), 
Sarndal, Swensson, and Wretman (1992), and Ghosh and Meeden (1997). 

Excellent textbooks for the Bayesian theory include Lindley (1965), Box 
and Tiao (1973), Berger (1985), and Schervish (1995). For Bayesian com- 
putation and Markov chain Monte Carlo, more discussions can be found in 
references cited in §4.1.4. More general results on invariance in estimation 
and testing problems are provided by Ferguson (1967) and Lehmann (1983, 


543 


544 References 


1986). The theory of shrinkage estimation was established by Stein (1956) 
and James and Stein (1961); Lehmann (1983) and Berger (1985) provide 
excellent discussions on this topic. The method of likelihood has more than 
200 years of history (Edwards, 1974). An excellent textbook on the MLE 
in generalized linear models is McCullagh and Nelder (1989). Asymptotic 
properties for MLE can be found in Cramér (1946), Serfling (1980), and Sen 
and Singer (1993). Asymptotic results for the MLE in generalized linear 
models are provided by Fahrmeir and Kaufmann (1985). 


An excellent book containing results for empirical c.d.f.’s and their prop- 
erties is Shorack and Wellner (1986). References for empirical likelihoods 
are provided in §5.1.2 and 86.5.3. More results in density estimation can 
be found, for example, in Rosenblatt (1971) and Silverman (1986). Dis- 
cussions of partial likelihoods and proportional hazards models are given in 
Cox (1972) and Fleming and Harrington (1991). More discussions on statis- 
tical functionals can be found in von Mises (1947), Serfling (1980), Fernholz 
(1983), Sen and Singer (1993), and Shao and Tu (1995). Two textbooks 
for robust statistics are Huber (1981) and Hampel et al. (1986). A general 
discussion of L-estimators and sample quantiles can be found in Serfling 
(1980) and Sen (1981). L-estimators in linear models are covered by Bickel 
(1973), Puri and Sen (1985), Welsh (1987), and He and Shao (1996). Some 
references on generalized estimation equations and quasi-likelihoods are Go- 
dambe and Heyde (1987), Godambe and Thompson (1989), McCullagh and 
Nelder (1989), and Diggle, Liang, and Zeger (1994). Two textbooks con- 
taining materials on variance estimation are Efron and Tibshirani (1993) 
and Shao and Tu (1995). 


The theory of UMP, UMPU, and UMPI tests in Chapter 6 is mainly 
based on Lehmann (1986) and Chapter 5 of Ferguson (1967). Berger (1985) 
contains a discussion on Bayesian tests. Results on large sample tests and 
chi-square tests can be found in Serfling (1980) and Sen and Singer (1993). 
Two textbooks on nonparametric tests are Lehmann (1975) and Randles 
and Wolfe (1979). 


Further materials on confidence sets can be found in Ferguson (1967), 
Bickel and Doksum (1977), Lehmann (1986), and Casella and Berger (1990). 
More results on asymptotic confidence sets based on likelihoods can be 
found in Serfling (1980). The results on high order accurate confidence 
sets (§7.4.3) are based on Hall (1992). The theory of bootstrap confidence 
sets is covered by Hall (1992), Efron and Tibshirani (1993), and Shao and 
Tu (1995). Further discussions on simultaneous confidence intervals can be 
found in Scheffé (1959), Lehmann (1986), and Tukey (1977). 


The following references are those cited in this book. Many additional 
references can be found in Lehmann (1983, 1986). 


References 545 


Arvesen, J. N. (1969). Jackknifing U-statistics. Ann. Math. Statist., 40, 
2076-2100. 


Bahadur, R. R. (1957). On unbiased estimates of uniformly minimum 
variance. Sankhya, 18, 211-224. 


Bahadur, R. R. (1964). On Fisher’s bound for asymptotic variances. Ann. 
Math. Statist., 35, 1545-1552. 


Bahadur, R. R. (1966). A note on quantiles in large samples. Ann. Math. 
Statist., 37, 577-580. 


Barndorff-Nielsen, O. E. (1978). Information and Exponential Families in 
Statistical Theory. Wiley, New York. 


Barndorff-Nielsen, O. E. and Cox, D. R. (1994). Inference and Asymp- 
totics. Chapman & Hall, London. 


Basag, J., Green, P., Higdon, D., and Mengersen, K. (1995). Bayesian 
computation and stochastic systems. Statist. Sci., 10, 3-66. 


Basu, D. (1955). On statistics independent of a complete sufficient statis- 
tic. Sankhya, 15, 377-380. 


Basu, D. (1958). On sampling with and without replacement. Sankhya, 
20, 287-294. 


Beran, R. (1987). Prepivoting to reduce level error of confidence sets. 
Biometrika, 74, 151-173. 


Berger, J. O. (1976). Inadmissibility results for generalized Bayes estima- 
tors of coordinates of a location vector. Ann. Statist., 4, 302-333. 


Berger, J. O. (1980). Improving on inadmissible estimators in continuous 
exponential families with applications to simultaneous estimation of 
gamma scale parameters. Ann. Statist., 8, 545-571. 


Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis, 
second edition. Springer-Verlag, New York. 


Bickel, P. J. (1973). On some analogues to linear combinations of order 
statistics in the linear model. Ann. Statist., 1, 597-616. 


Bickel, P. J. and Doksum, K. A. (1977). Mathematical Statistics. Holden 
Day, San Francisco. 


Bickel, P. J. and Yahav, J. A. (1969). Some contributions to the asymp- 
totic theory of Bayes solutions. Z. Wahrsch. Verw. Geb., 11, 257-276. 


546 References 


Billingsley, P. (1986). Probability and Measure, second edition. Wiley, 
New York. 


Blackwell, D. (1947). Conditional expectation and unbiased sequential 
estimation. Ann. Math. Statist., 18, 105-110. 


Box, G. E. P. and Tiao, G. C. (1973). Bayesian Inference in Statistical 
Analysis. Addison-Wesley, Reading, MA. 


Brown, L. D. (1966). On the admissibility of invariant estimators of one 
or more location parameters. Ann. Math. Statist., 37, 1087-1136. 


Brown, L. D. and Fox, M. (1974). Admissibility in statistical problems 
involving a location or scale parameter. Ann. Statist., 2, 248-266. 


Carroll, R. J. (1982). Adapting for heteroscedasticity in linear models. 
Ann. Statist., 10, 1224-1233. 


Carroll, R. J. and Cline, D. B. H. (1988). An asymptotic theory for 
weighted least-squares with weights estimated by replication. Biomet- 
rika, 75, 35-48. 


Casella, G. and Berger, R. L. (1990). Statistical Inference. Wadsworth, 
Belmont, CA. 


Chan, K. S. (1993). Asymptotic behavior of the Gibbs sampler. J. Amer. 
Statist. Assoc., 88, 320-325. 


Chen, J. and Qin, J. (1993). Empirical likelihood estimation for finite pop- 
ulations and the effective usage of auxiliary information. Biometrika, 
80, 107-116. 


Chen, J. and Shao, J. (1993). Iterative weighted least squares estimators. 
Ann. Statist., 21, 1071-1092. 


Chung, K. L. (1974). A Course in Probability Theory, second edition. 
Academic Press, New York. 


Clarke, B. R. (1986). Nonsmooth analysis and Fréchet differentiability of 
M-functionals. Prob. Theory and Related Fields, 73, 197-209. 


Cochran, W. G. (1977). Sampling Techniques, third edition. Wiley, New 
York. 


Cox, D. R. (1972). Regression models and life tables, J. R. Statist. Soc., 
B, 34, 187-220. 


Cramér, H. (1946). Mathematical Methods of Statistics. Princeton Uni- 
versity Press, Princeton, NJ. 


References 547 


Diggle, P. J., Liang, K.-Y., and Zeger, S. L. (1994). Analysis of Longitu- 
dinal Data. Clarendon Press, Oxford. 


Draper, N. R. and Smith, H. (1981). Applied Regression Analysis, second 
edition. Wiley, New York. 


Durbin, J. (1973). Distribution Theory for Tests Based on the Sample 
Distribution Function. SIAM, Philadelphia, PA. 


Dvoretzky, A., Kiefer, J., and Wolfowitz, J. (1956). Asymptotic mini- 
max character of the sample distribution function and of the classical 
multinomial estimator. Ann. Math. Statist., 27, 642-669. 


Edwards, A. W. F. (1974). The history of likelihood. Internat. Statist. 
Rev., 42, 4-15. 


Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Ann. 
Statist., 7, 1-26. 


Efron, B. (1981). Nonparametric standard errors and confidence intervals 
(with discussions). Canadian J. Statist., 9, 139-172. 


Efron, B. (1987). Better bootstrap confidence intervals (with discussions). 
J. Amer. Statist. Assoc., 82, 171-200. 


Efron, B. and Morris, C. (1973). Stein’s estimation rule and its competi- 
tors — An empirical Bayes approach. J. Amer. Statist. Assoc., 68, 
117-130. 


Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. 
Chapman & Hall, New York. 


Esseen, C. and von Bahr, B. (1965). Inequalities for the rth absolute 
moment of a sum of random variables, 1 < r < 2. Ann. Math. 
Statist., 36, 299-303. 


Fahrmeir, L. and Kaufmann, H. (1985). Consistency and asymptotic nor- 
mality of the maximum likelihood estimator in generalized linear mod- 
els. Ann. Statist., 13, 342-368. 


Farrell, R. H. (1964). Estimators of a location parameter in the absolutely 
continuous case. Ann. Math. Statist., 35, 949-998. 


Farrell, R. H. (1968a). Towards a theory of generalized Bayes tests. Ann. 
Math. Statist., 38, 1-22. 


Farrell, R. H. (1968b). On a necessary and sufficient condition for admis- 
sibility of estimators when strictly convex loss is used. Ann. Math. 
Statist., 38, 23-28. 


548 References 


Ferguson, T. S. (1967). Mathematical Statistics. Academic Press, New 
York. 


Fernholz, L. T. (1983). Von Mises Calculus for Statistical Functionals. 
Lecture Notes in Statistics, 19, Springer-Verlag, New York. 


Fleming, T. R. and Harrington, D. P. (1991). Counting Processes and 
Survival Analysis. Wiley, New York. 


Fuller, W. A. (1996). Introduction to Statistical Time Series, second edi- 
tion. Wiley, New York. 


Gelfand, A. E. and Smith, A. F. M. (1990). Sampling-based approaches to 
calculating marginal densities. J. Amer. Statist. Assoc., 85, 398-409. 


Geweke, J. (1989). Bayesian inference in econometric models using Monte 
Carlo integration. Econometrica, 57, 1317-1339. 


Geyer, C. J. (1994). On the convergence of Monte Carlo maximum likeli- 
hood calculations. J. R. Statist. Soc., B, 56, 261-274. 


Ghosh, M. and Meeden, G. (1997). Bayesian Methods in Finite Population 
Sampling. Chapman & Hall, London. 


Godambe, V. P. (1958). A unified theory of sampling from finite popula- 
tions. J. R. Statist. Soc., B, 17, 269-278. 


Godambe, V. P. and Heyde, C. C. (1987). Quasi-likelihood and optimal 
estimation. Internat. Statist. Rev., 55, 231-244. 


Godambe, V. P. and Thompson, M. E. (1989). An extension of quasi- 
likelihood estimation (with discussion). J. Statist. Plan. Inference, 
22, 137-172. 


Hall, P. (1988). Theoretical comparisons of bootstrap confidence intervals 
(with discussions). Ann. Statist., 16, 927-953. 


Hall, P. (1992). The Bootstrap and Edgeworth Expansion. Springer- 
Verlag, New York. 


Hall, P. and Martin, M. A. (1988). On bootstrap resampling and iteration. 
Biometrika, 75, 661-671. 


Hampel, F. R. (1974). The influence curve and its role in robust estima- 
tion. J. Amer. Statist. Assoc., 62, 1179-1186. 


Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and Stahel, W. A. 
(1986). Robust Statistics: The Approach Based on Influence Func- 
tions. Wiley, New York. 


References 549 


He, X. and Shao, Q.-M. (1996). A general Bahadur representation of M- 
estimators and its application to linear regression with nonstochastic 
designs. Ann. Statist., 24, 2608-2630. 


Hodges, J. L., Jr. and Lehmann, E. L. (1950). Some problems in minimax 
point estimation. Ann. Math. Statist., 21, 182-197. 


Hoeffding, W. (1948). A class of statistics with asymptotic normal distri- 
bution. Ann. Math. Statist., 19, 293-325. 


Hogg, R. V. and Tanis, E. A. (1993). Probability and Statistical Inference, 
fourth edition. Macmillan, New York. 


Huber, P. J. (1964). Robust estimation of a location parameter. Ann. 
Math. Statist., 35, 73-101. 


Huber, P. J. (1981). Robust Statistics. Wiley, New York. 


Ibragimov, I. A. and Has’minskii, R. Z. (1981). Statistical Estimation: 
Asymptotic Theory. Springer-Verlag, New York. 


James, W. and Stein, C. (1961). Estimation with quadratic loss. Proc. 
Fourth Berkeley Symp. Math. Statist. Prob., 1, 311-319. University 
of California Press, CA. 


Jeffreys, H. (1939, 1948, 1961). The Theory of Probability. Oxford Uni- 
versity Press, Oxford. 


Jones, M. C. (1991). Kernel density estimation for length biased data. 
Biometrika, 78, 511-519. 


Kalbfleisch, J. D. and Prentice, R. T. (1980). The Statistical Analysis of 
Failure Time Data. Wiley, New York. 


Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from in- 
complete observations. J. Amer. Statist. Assoc., 53, 457-481. 


Kiefer, J. and Wolfowitz, J. (1956). Consistency of the maximum likeli- 
hood estimator in the presence of infinitely many nuisance parame- 
ters. Ann. Math. Statist., 277, 887-906. 


Kolmogorov, A. N. (1933). Sulla determinazione empirica di una legge di 
distribuzione. Giorn. Inst. Ital. Attuari, 4, 83-91. 


Le Cam, L. (1953). On some asymptotic properties of maximum likeli- 
hood estimates and related Bayes’ estimates. Univ. of Calif. Publ. in 
Statist., 1, 277-330. 


550 References 


Lehmann, E. L. (1975). Nonparametrics: Statistical Methods Based on 
Ranks. Holden Day, San Francisco. 


Lehmann, E. L. (1983). Theory of Point Estimation. Springer-Verlag, 
New York. 


Lehmann, E. L. (1986). Testing Statistical Hypotheses, second edition. 
Springer-Verlag, New York. 


Lehmann, E. L. and Scheffé, H. (1950). Completeness, similar regions and 
unbiased estimation. Sankhya, 10, 305-340. 


Liang, K.-Y. and Zeger, S. L. (1986). Longitudinal data analysis using 
generalized linear models. Biometrika, 73, 13-22. 


Lindley, D. V. (1965). Introduction to Probability and Statistics from a 
Bayesian Point of View. Cambridge University Press, London. 


Liu, R. Y. and Singh, K. (1987). On a partial correction by the bootstrap. 
Ann. Statist., 15, 1713-1718. 


Loéve, M. (1977). Probability Theory I, fourth edition. Springer-Verlag, 
New York. 


Loh, W.-Y. (1987). Calibrating confidence coefficients. J. Amer. Statist. 
Assoc., 82, 155-162. 


Loh, W.-Y. (1991). Bootstrap calibration for confidence interval construc- 
tion and selection. Statist. Sinica, 1, 479-495. 


McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models, sec- 
ond edition. Chapman & Hall, London. 


Mendenhall, W. and Sincich, T. (1995). Statistics for Engineering and the 
Sciences, fourth edition. Prentice-Hall, Englewood Cliffs, NJ. 


Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and 
Teller, E. (1953). Equations of state calculations by fast computing 
machines. J. Chemical Physics, 21, 1087-1091. 


Milliken, G. A. and Johnson, D. E. (1992). Analysis of Messy Data, Vol. 
1: Designed Experiments. Chapman & Hall, London. 


Moore, D. S. and Spruill, M. C. (1975). Unified large-sample theory of 
general chi-squared statistics for test of fit. Ann. Statist., 3, 599-616. 


Miiller, H.-G. and Stadrmiiller, U. (1987). Estimation of heteroscedastic- 
ity in regression analysis. Ann. Statist., 15, 610-625. 


References 551 


Natanson, I. P. (1961). Theory of Functions of a Real Variable, Vol. 1, 
revised edition. Ungar, New York. 


Nummelin, E. (1984). General Irreducible Markov Chains and Non- 
Negative Operators. Cambridge University Press, New York. 


Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a 
single functional. Biometrika, 75, 237-249. 


Owen, A. B. (1990). Empirical likelihood confidence regions. Ann. 
Statist., 18, 90-120. 


Owen, A. B. (2001). Empirical Likelihood. Chapman & Hall/CRC, Boca 
Raton. 


Parthasarathy, K. P. (1967). Probability Measures on Metric Spaces. Aca- 
demic Press, New York. 


Petrov, V. V. (1975). Sums of Independent Random Variables. Springer- 
Verlag, Berlin-Heidelberg. 


Pitman, E. J. G. (1979). Some Basic Theory for Statistical Inference. 
Chapman & Hall, London. 


Puri, M. L. and Sen, P. K. (1985). Nonparametric Methods in General 
Linear Models. Wiley, New York. 


Qin, J. (1993). Empirical likelihood in biased sample problems. Ann. 
Statist., 21, 1182-1196. 


Qin, J. and Lawless, J. (1994). Empirical likelihood and general estimating 
equations. Ann. Statist., 22, 300-325. 


Qin, J., Leung, D., and Shao, J. (2002). Estimation with survey data 
under nonignorable nonresponse or informative sampling. J. Amer. 
Statist. Assoc., 97, 193-200. 


Quenouille, M. (1949). Approximation tests of correlation in time series. 
J. R. Statist. Soc., B, 11, 18-84. 


Randles, R. H. and Wolfe, D. A. (1979). Introduction to the Theory of 
Nonparametric Statistics. Wiley, New York. 


Rao, C. R. (1945). Information and the accuracy attainable in the esti- 
mation of statistical parameters. Bull. Calc. Math. Soc., 37, 81-91. 


Rao, C. R. (1947). Large sample tests of statistical hypotheses concerning 
several parameters with applications to problems of estimation. Proc. 
Comb. Phil. Soc., 44, 50-57. 


552 References 


Rao, C. R. (1973). Linear Statistical Inference and Its Applications, sec- 
ond edition. Wiley, New York. 


Rohatgi, V. K. (1976). An Introduction to Probability Theory and Math- 
ematical Statistics. Wiley, New York. 


Rosenblatt, M. (1971). Curve estimates. Ann. Math. Statist., 42, 1815- 
1842. 


Royden, H. L. (1968). Real Analysis, second edition. Macmillan, New 
York. 


Sarndal, C. E., Swensson, B., and Wretman, J. (1992). Model Assisted 
Survey Sampling. Springer-Verlag, New York. 


Savage, S. L. (1954). The Foundations of Statistics. Wiley, New York. 
Scheffé, H. (1959). Analysis of Variance. Wiley, New York. 

Schervish, M. J. (1995). Theory of Statistics. Springer-Verlag, New York. 
Searle, S. R. (1971). Linear Models. Wiley, New York. 


Sen, P. K. (1981). Sequential Nonparametrics: Invariance Principles and 
Statistical Inference. Wiley, New York. 


Sen, P. K. and Singer, J. M. (1993). Large Sample Methods in Statistics. 
Chapman & Hall, London. 


Serfling, R. J. (1980). Approzimation Theorems of Mathematical Statis- 
tics. Wiley, New York. 


Shao, J. (1989). Monte Carlo approximations in Bayesian decision theory. 
J. Amer. Statist. Assoc., 84, 727-732. 


Shao, J. (1993). Differentiability of statistical functionals and consistency 
of the jackknife. Ann. Statist., 21, 61-75. 


Shao, J. and Tu, D. (1995). The Jackknife and Bootstrap. Springer-Verlag, 
New York. 


Shorack, G. R. and Wellner, J. A. (1986). Empirical Processes with Ap- 
plications to Statistics. Wiley, New York. 


Silverman, B. W. (1986). Density Estimation for Statistics and Data Anal- 
ysis. Chapman & Hall, London. 


Smirnov, N. V. (1944). An approximation to the distribution laws of 
random quantiles determined by empirical data. Uspehi Mat. Nauk, 
10, 179-206. 


References 553 


Smyth, G. K. (1989). Generalized linear models with varying dispersion. 
J. R. Statist. Soc., B, 51, 47-60. 


Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a 
multivariate distribution. Proc. Third Berkeley Symp. Math. Statist. 
Prob., 1, 197-206. University of California Press, Berkeley, CA. 


Stein, C. (1959). The admissibility of Pitman’s estimator for a single 
location parameter. Ann. Math. Statist., 30, 970-979. 


Stone, C. J. (1974). Asymptotic properties of estimators of a location 
parameter. Ann. Statist., 2, 1127-1137. 


Stone, C. J. (1977). Consistent nonparametric regression (with discus- 
sion). Ann. Statist., 5, 595-645. 


Strawderman, W. E. (1971). Proper Bayes minimax estimators of the 
multivariate normal mean. Ann. Statist., 42, 385-388. 


Tanner, M. A. (1996). Tools for Statistical Inference, third edition. 
Springer-Verlag, New York. 


Tate, R. F. and Klett, G. W. (1959). Optimal confidence intervals for 
the variance of a normal distribution. J. Amer. Statist. Assoc., 54, 
674-682. 


Tierney, L. (1994). Markov chains for exploring posterior distributions 
(with discussions). Ann. Statist., 22, 1701-1762. 


Tsiatis, A. A. (1981). A large sample study of Cox’s regression model. 
Ann. Statist., 9, 93-108. 


Tsui, K.-W. (1981). Simultaneous estimation of several Poisson parame- 
ters under squared error loss. Ann. Inst. Statist. Math., 10, 299-326. 


Tukey, J. (1958). Bias and confidence in not quite large samples. Ann. 
Math. Statist., 29, 614. 


Tukey, J. (1977). Exploratory Data Analysis. Addison-Wesley, Reading, 
MA. 


van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University 
Press, Cambridge. 


Vardi, Y. (1985). Empirical distributions in selection bias models. Ann. 
Statist., 13, 178-203. 


von Mises, R. (1947). On the asymptotic distribution of differentiable 
statistical functionals. Ann. Math. Statist., 18, 309-348. 


554 References 


Wahba, G. (1990). Spline Models for Observational Data. SIAM, Philadel- 
phia, PA. 


Wald, A. (1943). Tests of statistical hypotheses concerning several param- 
eters when the number of observations is large. Trans. Amer. Math. 
Soc., 54, 426-482. 


Wald, A. (1950). Statistical Decision Functions. Wiley, New York. 


Weerahandi, S. (1995). Exact Statistical Methods for Data Analysis. 
Springer-Verlag, New York. 


Welsh, A. H. (1987). The trimmed mean in the linear model. Ann. Statist., 
15, 20-36. 


Woodruff, R. S. (1952). Confidence intervals for medians and other posi- 
tion measures. J. Amer. Statist. Assoc., 47, 635-646. 


Wu, C. F. J. (1986). Jackknife, bootstrap and other resampling methods 
in regression analysis (with discussions). Ann. Statist., 14, 1261-1350. 


List of Notation 


R: the real line. 


R*: the k-dimensional Euclidean 
space. 


c = (c1,...,€%): a vector (element) 
in R*, which is considered as 
a k x 1 matrix (column vec- 
tor) when matrix algebra is in- 
volved. 


the transpose of a vector c, 
which is considered asal xk 
matrix (row vector) when ma- 
trix algebra is involved. 


||cl|: the Euclidean norm of a vector 
cE R, |le|l? = ce. 


B: the Borel o-field on R. 
B*: the Borel o-field on R*. 


(a, b) and |a, b]: the open and closed 
intervals from a to 6. 


{a, b}: the set consisting of the ele- 
ments a and b. 


I,: the k x k identity matrix. 
A‘: the transpose of a matrix A. 


Det(A): the determinant of a ma- 
trix A. 


555 


tr(A): the trace of a matrix A. 


|| Al]: the norm of a matrix A de- 
fined as || Al|? = tr(A7 A). 


A7—!: the inverse of a matrix A. 


A~: the generalized inverse of a 
matrix A. 


A!/?: the square root of a nonneg- 
ative definite matrix A defined 
by AV2.41/2 = A, 


A7'/2: the inverse of A!/?. 
A®: the complement of the set A. 
P(A): the probability of the set A. 


I: the indicator function of the set 
A. 


é,: the point mass at x or the c.d.f. 
degenerated at 2x. 


{a,,}: a sequence of vectors or ran- 
dom vectors a1, a2, .... 


Gn — a: {ay} converges to a as n 
increases to oo. 
—a.s.: convergence almost surely. 


—p: convergence in probability. 


596 


—q: convergence in distribution. 


g’, g", and g“): the first-, second-, 
and kth-order derivatives of a 
function g on R. 


g(x+) or g(a—): the right or left 
limit of the function g at x. 


0g/0x or Vg: the partial derivative 
of the function g on R*. 


07g/0x0x7 or Vg: the second- 
order partial derivative of the 
function g on R*. 


F~-1'(p): the pth quantile of a c.d.f. 
F, F-*(@®)=int{e : F(@) > 4}. 


E(X) or EX: the expectation of 
a random variable (vector or 
matrix) X. 


Var(X): the variance (covariance 
matrix) of a random variable 
(vector) X. 


Cov(X, Y): the covariance between 
random variables X and Y. 


P: a family containing the popula- 
tion P that generates data 


br(P): the bias of an estimator T 
under population P. 


by (P): an asymptotic bias of an es- 
timator T under population P. 


mser(P): the mse of an estimator 
T under population P. 


Rr(P): the risk of an estimator T 
under population P. 


amser(P): an asymptotic mse of 
an estimator JT under popula- 
tion P. 


er 7,(P): the asymptotic relative 


List of Notation 


efficiency of T/, w.r.t. Th. 


ar(P): probability of type I error 
for a test T. 


Br(P): power function for a test T. 


Xj): the ith order statistic of X1, 
Gy 


X: the sample mean of X),... 
X= ee am 


Xn, 


S? : the sample variance (covari- 
ance matrix) of X),...,Xn, 
see Die OA 

We 

F,: the empirical c.d.f. based on 

Sonne om 


N(u, 07): the one-dimensional nor- 
mal distribution or random 
variable with mean p and vari- 
ance 0°. 


N;(u, 5): the k-dimensional nor- 
mal distribution or random 
vector with mean vector 4 and 
covariance matrix U. 


®(x): the standard normal c.d.f. 
Zq: the ath quantile of the stan- 
dard normal distribution. 

2 


x7: a random variable having the 
chi-square distribution x2. 


X79: the (1 — a)th quantile of the 
chi-square distribution x2. 


tr: the (1 — a)th quantile of the 
t-distribution t,. 


Fu.b,a: the (1—a@)th quantile of the 
F-distribution Fp. 


List of Abbreviations 


a.e.: almost everywhere. 


amse: asymptotic mean squared 
error. 


ANOVA: analysis of variance. 
a.s.: almost surely. 

BC: bias-corrected. 

BC,: accelerated bias-corrected. 


BLUE: best linear unbiased estima- 
tor. 


c.d.f.: cumulative distribution 
function. 


ch.f.: characteristic function. 
CLT: central limit theorem. 


GEE: generalized estimation equa- 
tion. 


GLM: generalized linear model. 
HPD: highest posterior density. 


ii.d.: independent and identically 
distributed. 


LR: likelihood ratio. 


LSE: least squares estimator. 


557 


MCMC: Markov 
Carlo. 


chain Monte 


MELE: maximum empirical likeli- 
hood estimator. 


m.g.f.: moment generating func- 
tion. 


MLE: maximum likelihood esti- 
mator. 


MQLE: maximum quasi-likelihood 
estimator. 


MRIE: minimum risk invariant es- 
timator. 


mse: mean squared error. 


p.d.f.: probability density func- 
tion. 


RLE: root of likelihood equation. 
SLLN: strong law of large numbers. 
UMA: uniformly most accurate. 


UMAI: uniformly most accurate 
invariant. 


UMAU: uniformly most accurate 
unbiased. 


UMP: uniformly most powerful. 


598 


UMPI: uniformly most powerful 


invariant. 


UMPU: uniformly most powerful 


unbiased. 


List of Abbreviations 


UMVUE: uniformly minimum 
variance unbiased estimator. 


WLLN: weak law of large numbers. 


w.r.t.: with respect to. 


Index of Definitions, Main 
Results, and Examples 


Corollaries 
1.1 (delta-method) 61 
1.2 (multivariate CLT) 69 
1.3 (CLT for independent random 
vectors) 69-70 
3.1 (UMVUE) 168 


3.2 (variance of a U-statistic) 177 
3.3 (conditions for UMVUE’s in 


normal linear models) 191 
4.1 (Pitman estimator) 257 
4.2 (MRIE for location) 259 


4.3 (admissibility and minimaxity 
in exponential families) 266 


5.1 (asymptotic normality of sam- 


ple quantiles) 355 
6.1 (UMP tests in exponential fam- 
ilies) 400 
7.1 (order statistics and sample 
quantiles) 501 
7.2 (Woodruff’s interval) 502 
Definitions 
1.1 (o-field) 2 
1.2 (measure) 3 
1.3 (measurable function) 6 
1.4 (integration) 10-11 
1.5 (m.g.f. and ch.f.) 33 


559 


1.6 (conditional expectation) 37 
1.7 (independence) 41 
1.8 (convergence mode) 50 
1.9 (stochastic order) 55 
2.1 (parametric family) 94 
2.2 (exponential family) 96 
2.3 (location-scale family) 99 
2.4 (sufficiency) 103 
2.5 (minimal sufficiency) 107 
2.6 (completeness) 110 
2.7 (admissibility) 116 
2.8 (unbiasedness) 119 
2.9 (invariance) 119 
2.10 (consistency) 132-133 
2.11 (asymptotic expectation and 
bias) 135 
2.12 (amse and relative efficiency) 
138 

2.13 (asymptotic level and consis- 
tency) 140 
2.14 (asymptotic level and confi- 
dence coefficient) 142 

3.1 (UMVUE) 161 
3.2 (U-statistic) 174 
3.3 (projection) 178 
3.4 (LSE) 182 
4.1 (Bayes action) 233 
4.2 (MRIE) 252 


560 Index of Definitions, Main Results, and Examples 


4.3 (likelihood and MLE) 274 
4.4 (asymptotic efficiency) 289 
5.1 (metric and norm) 320 
5.2 (differentiability of functionals) 
338-339 

5.3 (second-order differentiability 
of functionals) 342 

5.4 (consistency of variance estima- 
tors) 372 

6.1 (UMP tests) 394 
6.2 (monotone likelihood ratio) 
397-398 

6.3 (unbiasedness of tests and 
UMPLU tests) 404 

6.4 (similarity) 404 
6.5 (invariance of tests and UMPI 
tests) 417-418 


6.6 (likelihood ratio tests) 428 
7.1 (pivotal quantity) 471 
7.2 (UMA confidence sets) 488-489 
7.3 (unbiasedness of confidence sets 


and UMAU confidence sets) 
489-490 


7.4 (invariance of confidence sets 
and UMAI confidence sets) 493 


7.5 (accuracy of confidence sets) 


503 

7.6 (simultaneous confidence inter- 
vals) 519 

Examples 

1.1 (counting measure) 3 
1.2 (Lebesgue measure) 3 
1.3 (discrete c.d.f.) 9 
1.4 (continuous c.d.f.) 9 
1.5 (integration w.r.t. counting 
measure) 11 


1.6 (integration w.r.t. Lebesgue 
measure) 11-12 


1.7 (interchange of limit and inte- 


gration) 12 

1.8 (interchange of differentiation 
and integration) 13 

1.9 (iterative integration w.r.t. 
counting measure) 14 
1.10 (p.d.f. of a discrete c.d-f.) 15 
1.11 (Lebesgue p.d.f.) 16 
1.12 (c.d.f. w.r.t. a general mea- 
sure) 19 
1.13 (pairwise independence) 22 
1.14 (transformations) 23 
1.15 (p.d.f. of sum and ratio of two 
random variables) 24 
1.16 (t-distribution and F-distri- 
bution) 25 
1.17 (moments of normal distribu- 
tions) 29 
1.18 (Jensen’s inequality) 31-32 
1.19 (m.g.f. of normal distributions) 
34-35 

1.20 (gamma distributions and 
their m.g.f.’s) 36 
1.21 (conditional expectation given 
a discrete random variable) 38 
1.22 (the best predictor) 40-41 
1.23 (market survey) 44-45 
1.24 (first-order autoregressive pro- 
cess) 45-46 
1.25 (likelihood ratio) 48 
1.26 (various types of convergence) 
52-53 

1.27 (a.s. convergence of the best 
predictor) 55 
1.28 (WLLN and CLT) 58 
1.29 (application of the continuity 
theorem) 58 
1.30 (application of the continuous 
mapping theorem) 59 


Index of Definitions, Main Results, and Examples 561 


1.31 (application of the delta- 
method) 62 
1.32 (application of the SLLN) 


66-67 
1.33 (application of the CLT) 69 
1.34 (Edgeworth expansions) 72-73 
2.1 (measurement problems) 92 
2.2 (life-time testing problems) 93 


2.3 (survey problems) 93 
2.4 (k-dimensional normal family) 

94-95 
2.5 (binomial family) 96-97 


2.6 (univariate normal family) 97 


2.7 (multinomial family) 98 
2.8 (statistics and their distribu- 
tions) 101-102 

2.9 (order statistics) 102 
2.10 (sufficiency for binary data) 
103-104 

2.11 (truncation family) 106 
2.12 (order statistics) 106 
2.13 (minimal sufficiency in a uni- 
form family) 107 
2.14 (minimal sufficiency in expo- 
nential families) 109 
2.15 (completeness in a normal 
family) 110-111 
2.16 (completeness in a uniform 
family) 111 


2.17 (completeness in a nonpara- 


metric family) 111-112 
2.18 (application of Basu’s theo- 
rem) 112-113 
2.19 (estimation in measurement 
problems) 114 
2.20 (testing hypotheses) 114-115 


2.21 (clean-up toxic waste) 115 
2.22 (estimation of means) 118 


2.23 (testing in a binomial family) 
118-119 


2.24 (invariant decision rules) 120 
2.25 (Bayes and minimax rules) 


120-121 

2.26 (estimation in life-time testing 
problems) 123-124 
2.27 (estimation in survey prob- 
lems) 124 
2.28 (testing in a normal family) 
126-127 

2.29 (p-value) 127-128 


2.30 (randomized tests in a bino- 
mial family) 128 


2.31 (confidence intervals in a nor- 
mal family) 129 


2.32 (confidence sets in a normal 
family) 130-131 


2.33 (consistency of linear statistics 
and the sample variance) 133 


2.34 (consistency of the largest or- 


der statistic) 134 
2.35 (asymptotic bias in a binomial 
family) 137 


2.36 (amse in a Poisson family) 139 


2.37 (asymptotic properties of tests 

in anormal family) 140-141 
3.1 (UMVUE’s in a uniform family) 
162 
3.2 (UMVUE’s in a Poisson family) 
162-163 
3.3 (UMVUE’s in an exponential 
distribution family) 163-164 
3.4 (UMVUE?’s in a normal family) 


164-165 

3.5 (UMVUE’s in a power series 
family) 165-166 

3.6 (UMVUE’s in a nonparametric 
family) 166 


562 Index of Definitions, Main Results, and Examples 


3.7 (UMVUE’s in a uniform family) 
167-168 


3.8 (nonexistence of UMVUE) 168 


3.9 (Fisher information in a loca- 
tion-scale family) 170 


3.10 (variances of UMVUE’s) 172 
3.11 (variances of U-statistics) 177 
3.12 (simple linear regression) 185 
3.13 (one-way ANOVA) 185 


3.14 (two-way balanced ANOVA) 
186 


3.15 (UMVUE?’s in normal linear 
models) 188 


3.16 (BLUE’s in linear models) 191 
3.17 (random effects models) 192 
3.18 (BLUE’s in linear models) 


192-193 

3.19 (UMVUE’s under simple ran- 
dom sampling) 197 
3.20 (UMVUE’s under stratified 
simple random sampling) 
198-199 

3.21 (ratio estimators) 204-205 
3.22 (maximum of a polynomial) 
205 

3.23 (degradation curve) 205-206 


3.24 (moment estimators of mean 
and variance) 207-208 


3.25 (moment estimators in a uni- 


form family) 208-209 
3.26 (moment estimators in a bino- 
mial family) 209 
3.27 (moment estimators in a 
Pareto family) 209 
3.28 (U- and V-statistics for vari- 
ances) 212-213 
3.29 (linear models with a special 
error covariance matrix) 
215-216 


3.30 (linear models with a block 
diagonal error covariance ma- 
trix) 216 

3.31 (linear models with a diagonal 
error covariance matrix) 216 


1 (Bayes estimates) 234 
2 (Bayes actions) 235 
3 (generalized Bayes actions) 236 
4 (empirical Bayes actions) 237 


4.5 (hierarchical Bayes actions) 
238-239 


4.6 (admissibility of the sample 
mean in a normal family) 241 
4.7 (asymptotic properties of Bayes 
estimators) 242 
4.8 (asymptotic properties of Bayes 
estimators) 243 


4.9 (Bayes estimators in normal lin- 
ear models) 244-245 


4.10 (MCMC in normal linear mod- 


els) 248 
4.11 (MRIE for location in a normal 
family) 254 


4.12 (MRIE for location in an ex- 
ponential distribution family) 
254 


4.13 (Pitman estimators in a uni- 
form family) 254-255 


4.14 (Pitman estimators in a nor- 
mal family) 257 


4.15 (Pitman estimators in a uni- 
form family) 257 


4.16 (MRIE in a normal family) 259 
4.17 (MRIE in a uniform family) 
259-260 


4.18 (minimax estimators in a bi- 
nomial family) 262-263 


4.19 (minimax estimators in a nor- 
mal family) 264 


Index of Definitions, Main Results, and Examples 563 


4.20 (admissibility and minimaxity 
in a normal family) 266 


4.21 (minimax estimators in a Pois- 
son family) 267 


4.22 (simultaneous estimation in 


normal linear models) 267 
4.23 (Bayes estimators in a multi- 
nomial family) 268 


4.24 (simultaneous MRIE) 268 


4.25 (simultaneous minimax esti- 


mators) 268 
4.26 (simultaneous Bayes and min- 
imax estimators) 268 
4.27 (shrinkage estimators in nor- 
mal linear models) 272 
4.28 (maximizing likelihoods) 
273-274 

4.29 (MLE’s in a binomial family) 
275 

4.30 (MLE’s in a normal family) 
276-277 

4.31 (MLE’s in uniform families) 
277 

4.32 (MLE’s in a hypergeometric 
family) 277 
4.33 (MLE’s in a gamma family) 
277-278 

4.34 (MCMC in computing MLE’s) 
278 

4.35 (binomial GLM) 280-281 
4.36 (normal linear model) 282-283 
4.37 (Poisson GLM) 283 
4.38 (Hodges’ example of supereffi- 
ciency) 287 
4.39 (asymptotic properties of 


RLE’s in exponential families) 

292 

4.40 (one-step MLE’s in a Weibull 
family) 296 


5.1 (MELE’s in survey problems) 


327-328 

5.2 (biased sampling) 328-329 
5.3 (censored data) 329-330 
5.4 (density estimates) 332 
5.5 (convolution functionals) 341 
5.6 (L-estimators) 344 
5.7 (M-estimators) 346-347 
5.8 (Hodges-Lehmann estimator) 
351 

5.9 (interquartile range) 355 


5.10 (sample mean and sample me- 
dian) 356-357 
5.11 (consistency of GEE’s) 366 
5.12 (consistency of GEE’s) 367 
5.13 (asymptotic normality of M- 
estimators) 369 
5.14 (asymptotic normality of 
GEE’s) 371 
5.15 (asymptotic variances of func- 
tions of sample means) 373 
5.16 (substitution variance estima- 
tors for L-estimators) 375 
6.1 (UMP tests for simple hypothe- 
ses) 395-396 
6.2 (UMP tests in a binomial fam- 
ily) 396 
6.3 (monotone likelihood ratio in 
exponential families) 398 


6.4 (monotone likelihood ratio in a 


uniform family) 398-399 

6.5 (families having monotone like- 
lihood ratio) 399 

6.6 (UMP tests in a normal family) 
400 

6.7 (UMP tests in a binomial fam- 
ily) 400 

6.8 (UMP tests in a Poisson family) 
400-401 


564 Index of Definitions, Main Results, and Examples 


6.9 (UMP tests in a uniform family) 
401 

6.10 (UMP tests for two-sided hy- 
potheses in a normal family) 
403 
6.11 (UMPU tests for comparison 
of two Poisson or two binomial 
distributions) 408-409 
6.12 (UMPU tests in 2 x 2 contin- 
gency tables) 409-410 
6.13 (UMPI tests in location-scale 
families) 419 
6.14 (maximal invariant statistics 
in nonparametric problems) 


419-420 

6.15 (UMPI tests in a normal fam- 
ily) 420-421 
6.16 (UMPI tests in a two-sample 
problem) 421 
6.17 (UMPI tests in a normal fam- 
ily) 422 
6.18 (UMPI tests in one-way 
ANOVA) 424-425 


6.19 (UMPI tests in two-way bal- 
anced ANOVA) 425-426 


6.20 (LR tests in a uniform family) 


430 

6.21 (LR tests in normal linear 
models) 430-431 
6.22 (Wald’s and Rao’s score tests 
in GLM’s) A35 


6.23 (goodness of fit tests) 437 
6.24 (y?-tests in r x ¢ contingency 


tables) 439-440 
6.25 (Bayes tests in a normal fam- 
ily) AAI 
6.26 (empirical LR tests) 451 
6.27 (asymptotic tests) 453-454 


7.1 (confidence intervals in loca- 
tion-scale families) | 472-473 


7.2 (confidence intervals in a uni- 


form family) 473-474 
7.3 (Fieller’s interval) ATA 
7.4 (confidence sets in normal linear 
models) 474 
7.5 (confidence intervals in a Pois- 
son family) 476 


7.6 (confidence sets based on ranks) 
476 

7.7 (confidence intervals in one- 
parameter exponential fami- 
lies) 478-479 

7.8 (confidence intervals in multipa- 
rameter exponential families) 


479 

7.9 (confidence sets in normal linear 
models) 480 
7.10 (confidence intervals based on 
signed rank tests) 480 
7.11 (credible sets in a normal fam- 
ily) 481-482 


7.12 (prediction sets in normal lin- 
ear models) 482-483 
7.13 (lengths of confidence intervals 
in a uniform family) 484-485 
7.14 (shortest length confidence in- 
tervals in a normal family) 
486-487 

7.15 (UMAU confidence intervals in 
normal linear models) 490 
7.16 (UMA confidence bounds in a 
binomial family) 492-493 
7.17 (invariant confidence intervals 
in a normal family) 493 
7.18 (UMAI confidence bounds in a 
normal family) 494 
7.19 (invariant confidence intervals 
in a normal family) 494 
7.20 (asymptotic confidence sets for 
functions of means) 495 


Index of Definitions, Main Results, and Examples 565 


7.21 (asymptotic confidence sets 
based on functionals) 495-496 


7.22 (asymptotic confidence sets in 
linear models) 496 
7.23 (likelihood based confidence 
intervals in a binomial family) 
498-499 

7.24 (likelihood based confidence 
sets in a normal family) 
499-500 


7.25 (second-order accurate confi- 
dence bounds) 505 


7.26 (comparison of bootstrap and 
asymptotic confidence bounds) 


514-515 
7.27 (multiple comparison in one- 
way ANOVA) 520 


7.28 (simultaneous confidence in- 
tervals in simple linear regres- 
sion) 522-523 


7.29 (comparison of simultaneous 


confidence intervals) 524-525 
7.30 (confidence bands in a normal 
family) 525-526 
Figures 
2.1 (mse’s) 125 
2.2 (error probabilities) 126 
2.3 (a confidence set) 131 
4.1 (mse’s) 263 
5.1 (density estimates) 332 
7.1 (confidence intervals and accep- 
tance regions) 478 
7.2 (confidence sets based on likeli- 
hoods) 500 
Lemmas 


1.1 (independence of transforma- 
tions) 22 


1.2 (measurable functions w.r.t. a 

sub-o-field) 37 
1.3 (independence of o-fields) 41 
1.4 (conditions for a.s. convergence) 


51 
1.5 (Borel-Cantelli lemma) 53 
1.6 (Kronecker’s lemma) 62 


2.1 (dominating measure in para- 
metric families) 104 
3.1 (mean squared difference be- 
tween a statistic and its pro- 
jection) 178-179 
3.2 (moment of a sum of indepen- 
dent chi-square random vari- 


ables) 180 

3.3 (conditions for asymptotic nor- 
mality of LSE’s) 195 

4.1 (calculation of posterior means) 
243 

4.2 (irreducibility of Markov 
chains) 250 

4.3 (minimax estimators) 264 
5.1 (DKW’s inequality) 321 
5.2 (variation norm bound) 341-342 
5.3 (uniform WLLN) 364 
6.1 (UMP tests) 397 
6.2 (generalized Neyman-Pearson 
lemma) 397 

6.3 (monotonicity of expectations) 
398 

6.4 (change of sign) 402-403 


6.5 (similarity and UMPU) 405 
6.6 (Neyman structure and com- 


pleteness) 405 

6.7 (UMPU tests) 411 
Propositions 

1.1 (properties of measures) 4 

1.2 (properties of c.d.f.’s) 4 


566 Index of Definitions, Main Results, and Examples 


product measure theorem) 5 
measurable functions) 8 


1.3 ( 
1.4 ( 
1.5 (linearity of integrals) 12 
1.6 (monotonicity of integrals) 12 
1.7 ( 


calculus with Radon-Nikodym 
derivatives) 16-17 


1.8 (p.d-f. of transformations) 23 
1.9 (condition expectations with a 


joint p.d.f.) 38-39 
1.10 (properties of conditional ex- 
pectations) 39-40 
1.11 (conditional independence) 
41-42 

1.12 (properties of Markov chains) 
46-47 

1.13 (properties of submartingales) 
49 


1.14 (Doob’s decomposition) 49 
1.15 (convergence of submartin- 


gales) 49 
1.16 (Pélya’s theorem) 51 
1.17 (tightness) 56 
1.18 (Scheffé’s theorem) 59 
2.1 (completeness in exponential 

families) 110 
2.2 (rules based on a sufficient 

statistic) 117 
2.3 (uniqueness of asymptotic ex- 

pectation) 136 
2.4 (relationship between mse and 

amse) 138-139 
3.1 (Fisher information) 170 
3.2 (Fisher information in an expo- 

nential family) 171 


3.3 (estimators with variances the 
same as the Cramér-Rao lower 
bound) 172 


3.4 (conditions for BLUE’s) 192 
3.5 (properties of V-statistics) 211 


4.1 (existence and uniqueness of 


Bayes actions) 233 

4.2 (unbiasedness of Bayes estima- 
tors) 241 

4.3 (location invariant estimators) 
251 

4.4 (properties of location invariant 
estimators) 252 


4.5 (scale invariant estimators) 256 
4.6 (location-scale invariant estima- 
tors) 258 
5.1 (asymptotic properties of se- 
cond-order differentiable func- 


tionals) 343 

5.2 (consistency of GEE’s in i.i.d. 
cases) 363 

5.3 (consistency of GEE’s in i.i.d. 
cases) 363 

5.4 (consistency of GEE’s in non- 
ii.d. cases) 364 

5.5 (consistency of GEE’s in non- 
iid. cases) 366 

5.6 (consistency of GEE’s in non- 
ii.d. cases) 367 

6.1 (generalized Neyman-Pearson 
lemma) 397 

6.2 (properties of maximal invari- 
ant statistics) 418 

6.3 (invariance and _ sufficiency) 
420 

6.4 (relationship between UMPI 
and UMPU tests) 421 

6.5 (LR tests in exponential fami- 
lies) 429 

7.1 (pivotal quantities) ATA 
7.2 (construction of tests using a 
confidence set) AT7 

7.3 (properties of invariant confi- 
dence sets) 494 


Index of Definitions, Main Results, and Examples 567 


7.4 (asymptotic comparison of con- 
fidence sets in terms of volume) 
496 


Tables 


1 (discrete distributions on R) 18 
1.2 (distributions on R_ with 


Lebesgue p.d.f.’s) 20-21 
Theorems 

1.1 (Fatou’s lemma, dominated 
convergence theorem, and 
monotone convergence theo- 
rem) 13 

2 (change of variables) 13 
3 (Fubini’s theorem) 14 
4 (Radon-Nikodym theorem) 15 
5 (Cochran’s theorem) 27 
6 (uniqueness of distribution with 
a given ch.f.) 35 

1.7 (existence of conditional distri- 
butions) 43 

1.8 (convergence and uniform inte- 
grability) 51 

1.9 (weak convergence, Lévy- 


Cramér continuity theorem, 
and Cramér-Wold device) 56 


1.10 (continuous mapping) 59 
1.11 (Slutsky’s theorem) 60 
1.12 (delta-method) 60 
1.13 (WLLN and SLLN) 62 
1.14 (WLLN and SLLN) 65 
1.15 (Lindeberg’s CLT) 67 
1.16 (Edgeworth expansion) 72 


1.17 (Cornish-Fisher expansion) 73 
2.1 (properties of exponential fam- 
ilies) 98-99 

2 (factorization theorem) 104 
2.3 (minimal sufficiency) 108 


2.4 (Basu’s theorem) 112 
2.5 (Rao-Blackwell theorem) 117 
2.6 (amse) 139 
3.1 (Lehmann-Scheffé theorem) 162 
3.2 (conditions for UMVUE) 166 
3.3 (Cramér-Rao lower bound) 169 
3.4 (Hoeffding’s theorem) 176 
3.5 (asymptotic distribution of a U- 
statistic) 180 

3.6 (estimability in linear models) 
184 

3.7 (UMVUE’s in normal linear 
models) 186 

8 (distributions of UMVUE’s in 
normal linear models) 188 


9 (Gauss-Markov theorem) 189 


3.10 (conditions for BLUE’s in lin- 
ear models) 190 


3.11 (consistency of LSE’s) 193-194 


3.12 (asymptotic normality of 
LSE’s) 194 
3.13 (Watson and Royall theorem) 


196 


3.14 (UMVUE’s under stratified 
simple random sampling) 198 


3.15 (Horvitz-Thompson estimator) 


199-200 

3.16 (asymptotic distribution of a 
V-statistic) 212 
3.17 (asymptotic normality of 
weighted LSE’s) 213-214 

4.1 (Bayes formula) 232 
4.2 (admissibility of Bayes rules) 
240 

4.3 (admissibility of a limit of Bayes 
rules) 240 

4.4 (consistency of MCMC) 
246-247 

4.5 (MRIE for location) 253 


568 Index of Definitions, Main Results, and Examples 


4.6 (Pitman estimator) 253 
4.7 (properties of invariant rules) 
256 

4.8 (MRIE for scale) 256 
4.9 (MRIE for location) 259 
4.10 (MRIE in normal linear mod- 
els) 261 
4.11 (minimaxity of Bayes estima- 
tors) 261-262 
4.12 (minimaxity of limits of Bayes 
estimators) 263 
4.13 (estimators with constant 
risks) 264 
4.14 (admissibility in exponential 
families) 265 
4.15 (risk of James-Stein estima- 
tors) 269-270 
4.16 (asymptotic information in- 
equality) 287-288 
4.17 (asymptotic efficiency of 
RLE’s in i.i.d. cases) 290 
4.18 (asymptotic efficiency of 


RLE’s in GLM’s) 292-293 
4.19 (asymptotic efficiency of one- 


step MLE’s) 295-296 
4.20 (asymptotic efficiency of Bayes 
estimators) 297 
5.1 (asymptotic properties of em- 
pirical c.d.f.’s) 321 
5.2 (asymptotic properties of em- 
pirical c.d.f.’s) 322 


5.3 (nonparametric MLE) 324 
5.4 (asymptotic normality of 
MELE’s) 326 
5.5 (asymptotic properties of differ- 
entiable functionals) 340 
5.6 (differentiability of L-func- 
tionals) 344 


5.7 (differentiability of M-func- 
tionals) 347 


5.8 (differentiability of functionals 


for R-estimators) 349 

5.9 (a probability bound for sample 
quantiles) 352 
5.10 (asymptotic normality of sam- 
ple quantiles) 353 
5.11 (Bahadur’s representation) 
354 


5.12 (asymptotic normality of L- 
estimators in linear models) 
359 

(asymptotic normality of 
GEE?’s in i.i.d. cases) 367 
(asymptotic normality of 
GEE’s in non-i.i.d. cases) 369 
5.15 (consistency of substitution 
variance estimators) 374 


5.13 


5.14 


5.16 (substitution variance estima- 
tors for GEE’s) 375-376 
5.17 (consistency of jackknife vari- 
ance estimators for functions 

of sample means) 377 
5.18 (consistency of jackknife vari- 
ance estimators for LSE’s) 379 
5.19 (consistency of jackknife vari- 
ance estimators for GEE’s) 380 
5.20 (consistency of bootstrap esti- 
mators for functionals) 383 

6.1 (Neyman-Pearson lemma) 394 
6.2 (UMP tests in families having 
monotone likelihood ratio) 399 

6.3 (UMP tests for two-sided hypo- 
theses in exponential families) 
401-402 

6.4 (UMPU tests in exponential 
families) 406-407 

6.5 (asymptotic distribution of LR 
tests) 432 

6.6 (asymptotic distribution of 
Wald’s and Rao’s score tests) 
434 


Index of Definitions, Main Results, and Examples 569 


6.7 (asymptotic distribution of like- 
lihood based tests in GLM’s) 


435 
6.8 (asymptotic distribution of .?- 
tests) 436 


6.9 (asymptotic distribution of 
goodness of fit tests) 438 
6.10 (asymptotic distribution of 
Kolmogorov-Smirnov _ statis- 
tics) 447-448 
6.11 (asymptotic distribution of 
empirical LR, tests) 450 
6.12 (consistency of asymptotic 
tests) 452 
7.1 (confidence intervals based on a 
c.d.f.) 475 


7.2 (construction of confidence 


sets by inverting acceptance 


regions) AT7 
7.3 (shortest length confidence in- 
tervals) 485-486 


7.4 (UMA confidence set) 489 
7.5 (UMAU confidence set) 490 
7.6 (Pratt’s theorem) 491 
7.7 (UMAI confidence set) 493-494 
7.8 (Bahadur’s representation) 501 


7.9 (asymptotic correctness of 
bootstrap confidence _ sets) 
509-510 


7.10 (Scheffé’s simultaneous confi- 
dence intervals) 521 
7.11 (Tukey’s simultaneous confi- 
dence intervals) 523 


Author Index 


Arvesen, J. N., 378, 545 


Bahadur, R. R., 108, 110, 288, 501, 
545 


Barndorff-Nielsen, O. E., 49, 74, 
543, 545 


Basag, J., 250, 545 
Basu, D., 543, 545 
Beran, R., 515-516, 545 


Berger, J. O., 122, 236-238, 245- 
246, 273, 297, 441, 482, 543- 
545 


Berger, R. L., 543-544, 546 
Bickel, P. J., 297, 359, 543-545 


Billingsley, P., 1, 5-6, 14-15, 35, 43, 
49, 53, 56, 69-70, 543, 546 


Blackwell, D., 543, 546 

Box, G. E. P., 236, 543, 546 
Brown, L. D., 255, 273, 546 
Carroll, R. J., 216-217, 546 
Casella, G., 543-544, 546 

Chan, K. S., 247-248, 546 

Chen, J., 217, 328, 451, 501, 546 


Chung, K. L., 1, 16, 34, 64, 66, 543, 
546 


571 


Clarke, B. R., 348, 546 
Cline, D. B. H., 217, 546 


Cochran, W. G., 195, 197, 201-202, 
543, 546 


Cox, D. R., 49, 74, 333, 543-546 
Cramer, H., 543-544, 546 
Diggle, P. J., 544, 547 

Doksum, K. A., 543-545 
Draper, N. R., 182, 547 
Durbin, J., 449, 547 

Dvoretzky, A., 321, 547 
Edwards, A. W. F., 544, 547 


Efron, B., 271, 383, 506-508, 514, 
544, 547 


Esseen, C., 30, 547 

Fahrmeir, L., 544, 547 

Farrell, R. H., 240, 255, 547 
Ferguson, T. S., 31, 543-544, 548 


Fernholz, L. T., 340-341, 354, 544, 
548 


Fleming, T. R., 335, 544, 548 
Fox, M., 255, 546 


572 


Fuller, W. A., 285, 548 
Gelfand, A. E., 247, 548 
Geweke, J., 246, 548 

Geyer, C. J., 278, 548 

Ghosh, M., 543, 548 
Godambe, V. P., 543-544, 548 
Green, P., 545 


Hall, P., 71-74, 383, 504, 511-512, 
514-517, 544, 548 


Hampel, F. R., 339, 346, 544, 548 
Harrington, D. P., 335, 544, 548 
Has’minskii, R. Z., 297, 549 

He, X., 370, 544, 549 

Heyde, C. C., 544, 548 

Higdon, D., 545 

Hodges, J. L., Jr., 543, 549 
Hoeffding, W., 543, 549 

Hogg, R. V., 91, 549 

Huber, P. J., 340, 346, 360, 544, 549 
Ibragimov, I. A., 297, 549 

James, W., 269, 544, 549 
Jeffreys, H., 236, 549 

Johnson, D. E., 524, 550 

Jones, M. C., 332, 549 
Kalbfleisch, J. D., 329, 549 
Kaplan, E. L., 330, 549 
Kaufmann, H., 544, 547 


Author Index 


Kiefer, J., 321, 324, 547, 549 
Klett, G. W., 487, 553 
Kolmogorov, A. N., 1, 448, 549 


Lawless, J., 324, 336, 363, 451, 501, 
551 


Le Cam, L., 287, 549 


Lehmann, E. L., 31, 98, 110, 128, 
137, 171, 234, 261, 271, 273, 
297, 357, 397, 408, 420, 427, 
442-445, 456, 465, 543-544, 
549-550 


Leung, D., 337, 551 

Liang, K.-Y., 362, 544, 547, 550 
Lindley, D. V., 543, 550 

Liu, R. Y., 514, 550 

Loéve, M., 1, 31, 66, 550 
Loh, W.-Y., 517-518, 550 
Martin, M. A., 517, 548 
McCullagh, P., 282, 544, 550 
Meeden, G., 543, 548 

Meier, P., 330, 549 
Mendenhall, W., 524-525, 550 
Mengersen, K., 545 
Metropolis, N., 249, 550 
Milliken, G. A., 524, 550 
Moore, D. 8., 439, 550 
Morris, C., 271, 547 

Miller, H.-G., 216, 550 


Natanson, I. P., 342, 551 


Author Index 


Nelder, J. A., 282, 544, 550 
Nummelin, E., 247, 250, 551 
Owen, A. B., 324, 450-451, 501, 551 
Parthasarathy, K. P., 108, 551 
Petrov, V. V., 63, 551 

Pitman, E. J. G., 172, 548, 551 
Prentice, R. T., 329, 549 

Puri, M. L., 544, 551 


Qin, J., 324, 328-329, 336-337, 363, 
451, 501, 546, 551 


Quenouille, M., 376, 551 
Randles, R. H., 543-544, 551 
Rao, C. R., 189, 484, 543, 551-552 
Rohatgi, V. K., 543, 552 
Ronchetti, E. M., 548 
Rosenblatt, M., 544, 552 
Rosenbluth, A. W., 550 
Rosenbluth, M. N., 550 
Rousseeuw, P. J., 548 

Royden, H. L., 3, 111, 162, 552 
Sarndal, C. E., 195, 543, 552 
Savage, L. J., 543, 552 


Scheffé, H., 27, 110, 521, 543-544, 
550, 552 


Schervish, M. J., 543, 552 
Searle, S. R., 27, 182, 543, 552 


Sen, P. K., 32, 49, 70, 543-544, 551- 
552 


573 


Serfling, R. J., 49, 70, 179-180, 290, 
351, 355, 370, 435, 454, 543- 
544, 552 


Shao, J., 74, 175, 217, 246, 337, 
348, 378, 382-383, 544, 546, 
551-552 


Shao, Q.-M., 370, 544, 549 


Shorack, G. R., 49, 70, 320, 330, 
543-544, 552 


Silverman, B. W., 332, 544, 552 
Sincich, T., 524-525, 550 


Singer, J. M., 32, 49, 70, 543-544, 
552 


Singh, K., 514, 550 
Smirnov, N. V., 448, 552 
Smith, A. F. M., 247, 548 
Smith, H., 182, 547 
Smyth, G. K., 284, 553 
Spruill, M. C., 439, 550 
Stadrmiiller, U., 216, 550 
Stahel, W. A., 548 


Stein, C., 255, 269, 271, 544, 549 
553 


Stone, C. J., 295, 332, 553 
Strawderman, W. E., 271, 553 
Swensson, B., 195, 543, 552 
Tanis, E. A., 91, 549 

Tanner, M. A., 246, 250, 553 
Tate, R. F., 487, 553 


574 


Teller, A. H., 550 

Teller, E., 550 

Thompson, M. E., 544, 548 

Tiao, G. C., 236, 543, 546 
Tibshirani, R. J., 383, 508, 544, 547 
Tierney, L., 247-248, 250, 553 
Tsiatis, A. A., 335, 553 

Tsui, K.-W., 273, 553 


Tu, D., 74, 175, 378, 382-383, 544, 
552 


Tukey, J., 376, 524, 544, 553 

van der Vaart, A. W., 49, 543, 553 
Vardi, Y., 329, 553 

von Bahr, B., 30, 547 


Author Index 


von Mises, R., 543-544, 553 
Wahba, G., 332, 554 

Wald, A., 433, 435, 543, 554 
Weerahandi, S., 128, 554 


Wellner, J. A., 49, 70, 320, 330, 
543-544, 552 


Welsh, A. H., 544, 554 
Wolfe, D. A., 543-544, 551 
Wolfowitz, J., 321, 324, 547, 549 
Woodruff, R. S., 502, 554 
Wretman, J., 195, 543, 552 

Wu, C. F. J., 379, 554 

Yahav, J. A., 297, 545 

Zeger, 8. L., 362, 544, 547, 550 


Subject Index 


0-1 loss, 115 

x?-statistic, 437-440 
x?-test, 436, 437-440, 449 
A-system, 74 

m-system, 74 


o-field, 2; generated by a collection 
of sets, 2; generated by a mea- 
surable function, 7 


o-finite measure, 5 
A 


Absolute continuity, 15-16 
Absolute error loss, 123, 155 
Absolute moment, 28 
Acceptance region, 477 
Action, 113; see also decision 
Action space, 113 


Admissibility, 116-117, 134, 152, 
264-266, 309-310; in simulta- 
neous estimation 269-273, 311- 
312; of Bayes rules and lim- 
its of Bayes rules, 240-241; of 
MLE’s, 279, 314; of minimax 
estimators, 261; of MRIE’s, 
255 


575 


Almost everywhere (a.e.), 12 
Almost surely (a.s.), 12 
Alternative hypothesis, 115 


Analysis of variance (ANOVA), 27, 
185-186, 425-426, 520, 523-524 


Ancillary, 109, 112 
Approximate unbiasedness, 135 


Asymptotic accuracy: of confidence 
sets, 503-505, 510-518, 537- 
538; of point estimators, see 
asymptotic efficiency 


Asymptotic bias, 135-137, 204, 371; 


see also bias 


Asymptotic confidence set or inter- 
val, 495-503, 514 


Asymptotic correctness: of confi- 
dence sets, 495-503, 509-510; 
of simultaneous confidence in- 
tervals, 519; of tests, 140-141, 
452-454 


Asymptotic covariance matrix, 
286-287, 371-372, 452, 495-496 


Asymptotic criteria, 131-132 


Asymptotic efficiency, 138-139, 
286-289, 496; of Bayes estima- 
tors, 297-298; of L-estimators 


576 


in linear models, 359; of M- 
estimators, 369; of MELE’s, 
326; of MLE’s and RLE’s, 290- 
295; of one-step MLE’s, 295- 
296; of sample mean, sample 
median, and trimmed sample 
mean, 355-357; of tests, 454 


Asymptotic expectation, 135-136 
Asymptotic inference, 139-142 


Asymptotic mean squared error 
(amse), 138-139; of functions 
of means, 204-207, 212, 286, 
371; of U-statistics 180-181; of 
UMVUE’s, 173 


Asymptotic normality, 70, 286; of 
Bayes estimators, 297-298; of 
empirical c.d.f., 320; of density 
estimators, 330-331; of func- 
tions of sample means, 101- 
102, 145, 204-206; of L- 
estimators, 357-359; of LSE’s, 
134-135; of M-estimators and 
GEE estimators, 367-371; of 
MELE’s, 326; of MLE’s and 
RLE’s, 290-295; of one-step 
MLE’s, 295-296; of rank tests, 
445-446; of sample quantiles, 
353; of sample moments, 207, 
210; of statistical functionals, 
338-339; of sum of independent 
random variables, 67-70; of U- 
statistics, 180; of V-statistics, 
212; of weighted LSE’s, 213 


Asymptotic pivotal quantity, 495 


Asymptotic optimality, 173, 286; of 
tests, 435 


Asymptotic relative efficiency, 138- 
139, 218, 215, 217, 262, 355- 
357, 369 


Subject Index 


Asymptotic significance level: of 
confidence sets, 142, 495; of si- 
multaneous confidence inter- 
vals, 519-520; of tests, 140-141, 
428, 432, 434, 437-438, 442, 
450 


Asymptotic test, 433, 452-454, 497- 
498 


Asymptotic unbiasedness, 135, 161, 
204, 242, 297 


Asymptotic variance, 138-139, 286, 
371; see also asymptotic co- 
variance matrix 


Asymptotically pivotal quantity, 
495-497 


Automatic bootstrap percentile, 
539 


Autoregressive time series, 45-46, 
85, 285-286 


B 


Bahadur’s representation, 354, 501 
Baseline hazard function, 334 
Basu’s theorem, 112 


Bayes action, 233-235, 239, 267, 
440 


Bayes estimator, 239-245, 261-263, 
266-269, 279, 297-298 


Bayes factor, 440-441 

Bayes formula, 232-233 

Bayes risk, 120, 239-241, 263 
Bayes rule, 120-121, 231, 239-240 


Bayes test, 301, 440-441 


Subject Index 


Bayesian approach, 122, 231-233, 
274, 440, 480-482 


Bayesian hypothesis testing, see 
Bayes tests 


Behrens-Fisher problem, 414 


Bernoulli variable, see binary ran- 
dom variable 


linear unbiased estimator 
(BLUE), 189-193, 204, 206, 
213, 215, 319 


Best 


Beta distribution, 20, 262, 414 


Bias, 119, 112-118, 210-211, 241, 
252-253, 256, 371; see also 
asymptotic bias 


Biased sampling, 328, 336 
Binary random variable, 122 
Binomial distribution, 18, 44-45 
Biostatistics, 329, 361 


Bivariate normal distribution, 416, 
474; see also multivariate nor- 
mal distribution and normal 
distribution 


Bonferroni’s method or intervals, 
519-520 


Bootstrap, 376, 380, 505-506 


Bootstrap accelerated _bias-cor- 
rected (BC,) percentile, 508, 
514, 538 


Bootstrap bias-corrected (BC) per- 
centile, 507-508, 513-515 


Bootstrap calibrating, 517-518 


Bootstrap confidence set or inter- 
val, 505-506 


577 


Bootstrap data or sample, 380-381, 
506 


Bootstrap distribution estimator, 
382-383 


Bootstrap inverting, 516-517 


Bootstrap percentile, 506-507, 509, 
513-514 


Bootstrap prepivoting, 515-516 


Bootstrap sampling procedure, 
506, 539-540 


Bootstrap-t, 509, 511 
Bootstrap variance estimator, 381 
Bootstrapping pairs, 539 
Bootstrapping residuals, 539 
Borel-Cantelli lemma, 53 
Borel function, 7 
Borel o-field, 2 
Bounded completeness, 110, 405 
Bounded in probability, 55 
Breslow’s estimator, 335 

C 
Cartesian product, 5 
Cauchy distribution, 20, 25 
Cauchy-Schwarz inequality, 29 
Censored data, 329, 333 
Censoring times, 329, 333 


Central limit theorem (CLT), 67, 
69-70 


578 


Central moment, 28, 210 
Change of variables, 13 


Characteristic function (ch.f.), 20, 
33; properties of, 34-36 


Chebyshev’s inequality, 32 

Chi-square distribution, 20, 23, 25, 
27; see also noncentral chi- 
square distribution 

Cluster, 199, 281 

Cluster sampling, 199 


Cochran’s theorem, 27 


Comparison of two treatments, 408, 
413 


Completeness, 109-112, 162-166, 
174, 187, 196, 198, 267, 405 


Completeness of a class of decision 
rules, 152 


Composite hypothesis, 397 
Conditional distribution, 43 


Conditional expectation, 37; prop- 
erties of, 337-340 


Conditional independence, 42 
Conditional likelihood, 284-285 
Conditional p.d.f., 39 


Conditional probability, 37; prop- 
erties of, 37-40 


Confidence band, 525-526 


Confidence bound, 129-130, 478- 
479, 488-490, 492-494, 503-518 


Subject Index 


Confidence coefficient, 129-131, 
471, 477, 484; of simultaneous 
confidence intervals, 519 


Confidence interval, 129-130, 472- 
476, 478-479, 501-502; invari- 
ance of, 493-494; properties of, 
484-493, 502, 509-515; see also 
confidence bound and simulta- 
neous confidence intervals 


Confidence set, 122, 129-131, 142, 
471-474, 476-477, 495, 497- 
501, 505, 515-517; invariance 
of, 493-494; properties of, 484- 
493, 496, 503-505; see also con- 
fidence bound, confidence in- 
terval and simultaneous confi- 
dence intervals 


Conjugate prior, 235, 299 


Consistency: of point estimators, 
132-135, 504; of Bayes estima- 
tors 242-245, 297; of bootstrap 
estimators, 382-383; of empiri- 
cal c.d.f.’s, 320-321; of GEE es- 
timators, 363-367; of jackknife 
estimators, 377-380; of LSE’s, 
193-194; of MLE’s and RLE’s, 
290-293, of moment estima- 
tors, 207; of sample quantiles, 
352; of tests, 140-141, 435, 
452-454; of U-statistics, 177; of 
UMVUE’s, 172; of variance 
estimators, 181, 213, 215-217, 
372-380, 382-383, 452-454, 
505, 509 


Contingency table, 409-410, 439 
Continuous c.d.f., 9 
Contrast, 522-523 


Convergence almost surely, 50 


Subject Index 


Convergence in distribution or in 
law, 50; properties of, 56-57 


Convergence in Ly, 50 
Convergence in probability, 50 
Convex function, 31, 80 
Convex set, 31 

Convolution, 341 


Cornish-Fisher expansion, 73, 503- 
505, 510-513, 517 


Correlation coefficient, 29 
Countable set, 4 
Counting measure, 3 
Covariance, 28-29 
Covariate, 182, 280 


Coverage probability, 129; conver- 
gence speed of, 514-515, 518; 
estimator of, 517 


Crameér’s continuity condition, 72 


Cramér-Rao low bound, 169-173, 
186, 215 


Crameér-von Mises test, 448-449 
Cramér-Wold device, 56 

Credible interval, see credible set 
Credible set, 480-482, 487 

Critical region, see rejection region 
Cumulant generating function, 34 
Cumulants, 34 


Cumulative distribution function 
(c.d.f.), 4, 6 


579 


Cumulative hazard function, 334 
D 


Data reuse method, see resampling 
method 


Decision, 113; see also action 
Decision rule, 113, 231, 239 
Decision theory, 91, 113 
Delta-method, 61 
DeMorgan’s law, 2 

Density estimation, 330-332 


Differentiability or differential of 
functionals, 338-339, 342 


Dirichlet distribution, 268 
Discrete c.d.f., 9 

Discrete p.d.f., 15-18 

Discrete random variable, 9 
Discrete uniform distribution, 18 
Dispersion measure, 123 
Dispersion parameter, 280-281 
Distance, 320; Mallows’, 322 
Distribution, 8 

Distribution-free test, 442 


Dominated convergence theorem, 
13, 40 


Dominated family, 94 
Doob’s decomposition, 49 
Double bootstrap, 516 


Double exponential distribution, 21 


580 


Dunnett’s interval, 542 


Dvoretzky-Kiefer- Wolfowitz in- 
equality, 321 


E 


Edgeworth expansion, 71-73, 503- 
504, 510, 513, 517 


Egoroff’s theorem, 13, 76 
Empirical Bayes, 236-237, 269 


Empirical c.d.f., 123, 320; proper- 
ties of, 320-324 


Empirical likelihood, 323-324, 327- 
329, 332, 336-337, 362-363, 
449-451, 500-501 


Empirical likelihood ratio, 449-451 


Empirical likelihood ratio test, 449- 
451, 500 


Equal-tailed confidence 
473 


interval, 


Equal-tailed test, 412 
Equicontinuity, 364 


Equivariant estimator, 251; see also 
invariant estimator 


Estimability, 161, 184 
Estimation, 114 

Estimator, see point estimator 
Euclidean space, 2 

Event, 2 


Expectation or expected value, 11, 
28; see also mean 


Explanatory variable, see covariate 


Subject Index 


Exponential distribution, 9, 20 


Exponential family, 96-97, 279; 
canonical form of, 96; full rank 
of, 97; natural parameter and 
natural parameter space of, 96; 
properties of, 97-99, 106, 109- 
110, 171, 265, 285, 292, 298, 
398, 400, 406-407, 429, 478 


External bootstrapping, 540 


F 


F-distribution, 21, 25; see also non- 
central F-distribution 


F-test, 413-414 

Factor, 425-426; interaction of, 426 
Factorial experiment, 186 
Factorization theorem, 104 
Fatou’s lemma, 13, 40 
Feller’s condition, 68 
Fieller’s interval, 474 
First-order accuracy, 503 
First-order ancillary, 109 
First-stage sampling, 202 
Fisher’s exact test, 410 


Fisher information, 169-171, 265, 
287-289, 295, 432-435, 498- 
499 


Fisher-scoring, 278, 283, 296 
Fréchet differentiability, 339, 342 
Frequentist approach, 231, 239 


Fubini’s theorem, 14 


Subject Index 


G 


Gamma distribution, 20, 36 


Gateaux differentiability, 338, 374, 
378 


Gauss-Markov theorem, 189 


Generalized Bayes, 235-236, 239- 
240 


Generalized estimating equations 
(GEE), 359-363 


Generalized estimating equations 
(GEE) estimator, 360; asymp- 
totic normality of, 367-371; 
consistency of, 363-364, 366- 
367; variance estimation for, 
375-376, 379-380 


Generalized inverse, 183 


Generalized linear model (GLM), 
279-280, 292-293, 435 


Geometric distribution, 18 

Gibbs sampler, 247-248 

Gini’s mean difference, 175, 344 

Goodness of fit test, 437-439, 449 

Group of transformations, 119, 417 
H 


Hadamard differentiability, 
339, 342 


338- 


Hajek and Renyi’s inequality, 32 
Hazard function, 334 
Hierarchical Bayes, 237-239 


Highest posterior density (HPD), 
481, 487 


581 


Histogram, 91, 331 


Hodges-Lehmann estimator, 349, 
453-454; asymptotic normality 
of, 351 


Hoeffding’s theorem, 176 
Holder’s inequality, 29 


Horvitz-Thompson estimator, 199- 
201, 328 


Hybrid bootstrap, 508-509, 511- 
512, 514-515 


Hypergeometric distribution, 18, 
277, 409-410 


Hyperparameter, 236-238 


Hypothesis testing, 114, 122, 125, 
393 


I 
Importance function, 246 
Importance sampling, 245 
Improper prior, 235 


Independence, 22, 41; see also con- 
ditional independence and 
pairwise independence 


Independence chain, 250 


Independent and identically distri- 
buted (i.i.d.), 62 


Indicator function, 7 

Induced likelihood function, 312 
Induced measure, 8 

Inference, 122, 139 


Information inequality, 


287 


169-170, 


582 


Influence function, 339; in variance 
estimation, 374 


Integrability, 11 

Integral, 10 

Integration, 10; by parts, 77 
Interquartile range, 355 


Interval estimator or set estimator, 
122, 129; see also confidence 
interval or set 


Invariance, 119, 122; in linear mod- 
els, 260-261, 422-427; in loca- 
tion problems, 251-255; in 
location-scale problems, 257- 
260; in scale problems, 255- 
257; in testing hypothesis, 417- 
427; of confidence sets, 493-494 


Invariant decision problem, 119 
Invariant decision rule, 119 
Invariant estimator, 251 
Invariant family, 119 

Inverse gamma distribution, 299 
Inverse image, 6 

Inversion formula, 35 

Iterative bootstrap, 517 


J 


Jackknife, 175, 376-380 
Jacobian, 23 


James-Stein estimator, 269, 271- 
272 


Jensen’s inequality, 31, 118 


Subject Index 


Joint c.d.f. or p.d.f., 6, 19 
kK 


Kaplan-Meier estimator, 330 
Kernel, 174, 331 

Kernel density estimator, 330 
Kolmogorov’s inequality, 32 


Kolmogorov-Smirnov statistic, 447, 
526 


Kolmogorov-Smirnov test, 447-449 
Kronecker’s lemma, 62 


Kurtosis, 514 
L 


L-functional, 343-344 


L-estimator, 343-345, 351, 357-359, 
375 


Ly distance or norm, 322 


Lagrange multiplier, 324-325, 328, 
336 


Law, see distribution 


Law of large numbers, 62; see also 
strong law of large numbers 
and weak law of large numbers 


Least absolute deviation estimator, 
see minimum L,, distance esti- 
mator 


Least squares estimator (LSE), 182 
282-283; asymptotic normality 
of, 194-195; consistency of, 
193-194; efficiency of, 186, 188, 
213-214, 261, 267, 272; in con- 
fidence sets, 474, 480, 490; in 


Subject Index 


prediction, 483; in simultane- 
ous confidence intervals, 521; 
in testing problems, 416, 424- 
426, 430; inadmissibility of, 
272; invariance of, 260-261; ro- 
bustness of, 189-193; unbiased- 
ness of, 184; variance estima- 
tion for, 375, 378-379 


Lebesgue integral, 11 
Lebesgue measure, 3 

Lebesgue p.d.f., 22 
Lehmann-Scheffé theorem, 162 


Length of a confidence interval, 
130, 484-488, 502, 518, 525 


Level of significance: in hypothesis 
testing, 126, 140, 393; of a con- 
fidence set, 129, 142, 471; of a 
prediction set, 482; of simulta- 
neous confidence intervals, 519 


Lévy-Cramér continuity theorem, 
56 


Liapounov’s condition, 69 
Liapounov’s inequality, 30 
Life-time testing, 93, 123, 329 
Likelihood function, 274 
Likelihood equation, 274-275 


Likelihood methods, 273-274, 323- 
325, 333-337, 428, 433-434, 
449-450, 497-498, 


Likelihood ratio, 48, 397, 428, 449 


Likelihood ratio (LR) test, 428; 
asymptotic properties of, 432; 
in exponential families, 429- 
430; in normal linear models, 
430-431 


583 


Limiting confidence 
142, 495 


coefficient, 


Limiting size, 140, 442 
Lindeberg’s condition, 67-68 


Linear function of order statistics, 
351-352; see also L-estimator 


Linear model, 182; Bayes estima- 
tors in, 244-245; BLUE?’s in, 
189-190; confidence sets in, 
474, 480, 490, 496; invariance 
in, 260-261; L-estimators in, 
358-359; LR tests in, 430; 
LSE’s in, 182; M-estimators in, 
360; prediction sets in, 482- 
483; shrinkage estimators in, 
272; simultaneous confidence 
intervals in, 521-524; UMPI 
tests in, 422-427; UMPU tests 
in, 415-416; UMVUE?’s in, 186, 
191; variance estimation in, 
375, 378-379; with random co- 
efficients, 191, 205 


Link function, 280 


Location family, 99, 251; confidence 
intervals in, 472-473; MRIE’s 
in, 252-255; invariance in, 120, 
251-252 


Location-scale family, 99, 257; 
Fisher information in, 170; 
MRIE’s in, 259-261; invariance 
in, 120, 258; UMPI tests in, 
491 


Log-distribution, 18 


Log-likelihood equation, see likeli- 
hood equation 


Log-normal distribution, 21 


Logistic distribution, 21 


584 


Longitudinal data, 361 


Loss function, 113, 116; convexity 
of, 117, 233, 253, 256, 260, 264, 
267; invariance of, 119, 251, 
255, 258; see also absolute er- 
ror loss, squared error loss, and 
0-1 loss 


M 


M-functional, 345 


M-estimator, 346-348; asymptotic 
normality of, 367-369; consis- 
tency of, 363; in linear models, 
360 


Marcinkiewicz and Zygmund’s in- 
equality, 31 


Marginal c.d.f. or distribution, 6 
Marginal p.d.f., 22 


Markov chain, 45; properties of, 46- 
47, 246-247 


Markov chain Monte Carlo 


(MCMC), 245-250, 278 
Markov’s inequality, 32 
Martingale, 48; properties of, 49 
Maximal invariant, 418 


Maximum empirical likelihood esti- 
mator (MELE), 324 


Maximum likelihood — estimator 
(MLE), 274; asymptotic effi- 
ciency of, 290-293; in confi- 
dence sets, 497-498; in GLM, 
281-282; in LR tests, 429 


Maximum profile likelihood estima- 
tor, 336 


Subject Index 


Maximum quasi-likelihood estima- 
tor (MQLE), 284, 362 


Maximum likelihood method, 273- 
274 


Mean, 28; see also expectation and 
expected value 


Mean absolute error, 123 


Mean squared error (mse), 123; 
consistency in, 133 


Measurable function, 6 


Measurable space, 2 


Measure, 3; continuity of, 4; mono- 
tonicity of, 4; subadditivity of, 
4 


Measure space, 3 
Measurement problem, 92, 114 
Median, 91, 155 

Metric, see distance 
Metropolis algorithm, 249-250 
Minimal completeness, 152 


Minimal sufficiency, 107-108; in ex- 
ponential families, 109 


Minimax estimator, 261-264, 266, 
271 


Minimaxity, 120-121 


Minimum L, distance estimator, 
346 


Minimum risk invariant estimator 


(MRIE), 252 


Minkowski’s inequality, 30 


Missing data, 337 


Subject Index 


Mixture distribution, 278, 353 
Moment, 28; method of, 207, 237 
Moment estimator, 207-210 


Moment — generating — function 
(m.g.f.), 18, 20-21, 33; proper- 
ties of, 33-36 


Monotone convergence theorem, 
13, 40 


Monotone likelihood ratio, 397-398; 
in exponential families, 298 


Monte Carlo, 245-246, 381, 506, 
516 


Multinomial distribution, 98; in y?- 
tests, 436, 438; in contingency 
tables, 410, 439 


Multiple comparison, 520, 523 
Multivariate CLT, 69 


Multivariate normal distribution or 
p.df., 19, 29, 79, 82; see 
also asymptotic normality, 
bivariate normal distribution, 
and normal distribution 


N 


Nearest neighbor method, 332 
Negative binomial distribution, 18 
Negative part of a function, 11 


Newton-Raphson method, 278, 


283, 295 
Neyman structure, 405 


Neyman-Pearson lemma, 289, 394, 
397 


Nominal level, 517 


585 


Noncentral chi-square distribution, 
26-27, 81; see also chi-square 
distribution 


Noncentral F-distribution, 26-27, 
79; see also F-distribution 


Noncentral t-distribution, 26, 79; 
see also t-distribution 


Noncentrality parameter, 26 
Noninformative prior, 235 
Nonlinear regression, 283, 361 
Nonparametric family, 95 


Nonparametric likelihood function, 
323 


likeli- 


Nonparametric maximum 
hood estimator, 324 


Nonparametric method, 95 
Nonparametric model, 95 
Nonparametric test, 442 


Norm, 320 


Normal distribution or p.d.f., 19- 
20, 29, 79, 82; see also asymp- 
totic normality, bivariate nor- 
mal distribution, multivariate 
normal distribution, and stan- 
dard normal distribution 


Normalizing and variance stabiliz- 
ing transformation, 507 


Nuisance parameter, 280 


Null hypothesis, 115 
O 


One-sample problem, 411, 444 


586 


One-sample t-test, 412 


One-sample Wilcoxon _ statistic, 


175 


One-sided confidence interval, see 
confidence bound 


One-sided hypothesis, 399 
One-step MLE, 295 
One-way ANOVA, 185 
Optimality in risk, 114 


Order statistics, 102; completeness 
of, 111-112; p.d-f. of, 102; suf- 
ficiency of, 106 


Outcome, 1 
Over-dispersion, 281 
P 


p-value, 127-128, 441 

Pairwise independence, 22 
Parameter, 94 

Parameter space, 94 
Parametric bootstrapping, 538 


Parametric family, 94; identifiabil- 
ity of, 94, 183 


Parametric method, 95 
Parametric model, 94, 231 
Pareto distribution, 21, 209 
Partial likelihoods, 333-334 
Partition, 8 


Permutation test, 443-444 


Subject Index 


Pitman’s estimator, 253, 257; min- 
imaxity of, 264 


Pivotal quantity, 471, 483 
Point estimator, 122 

Point mass, 19 

Poisson distribution, 18 
Pélya’s theorem, 51 
Polynomial regression, 185, 205 
Population, 91 

Positive part of a function, 11 


Posterior distribution or p.d.f., 231- 
232; approximate normality of, 
297; computation of, 245 


Power function, 393 

Power series distribution, 1438, 165 
Power set, 2 

Pratt’s theorem, 491 

Prediction, 40, 225, 482 
Prediction interval or set, 482-483 
Predictor, 40, 482 

Prior distribution or p.d.f., 231 


Probability function 


(p.d.f.), 15 


density 


Probability measure, 3 
Probability space, 3 
Product-limit estimator, 330 
Product measure, 5 


Product o-field, 5 


Subject Index 


Product space, 5 
Profile likelihood, 336 
Profile empirical likelihood, 336 


Profile empirical likelihood ratio 
test, 449-451 


Projection: on lower dimension 
spaces, 76; on random ele- 
ments, 178 


Projection matrix, 188, 415, 433, 
436-437 


Projection method, 178-180 
Proportional allocation, 199 
Proportional hazards model, 334 


Pseudo-likelihood equation, 362 


Q 
Quantile, 338, 351, 501 
Quasi-likelihood, 284, 361-362 
R 


R-estimator, 349-351 


Radon-Nikodym derivative or den- 
sity, 15; properties of, 16-17 


Radon-Nikodym theorem, 15 


Random censorship model, 329, 


333-334 


Random effects model, 192, 426- 
427 


Random element, 7 
Random experiment, 1 


Random variable or vector, 7 


587 


Random walk chain, 250 


Randomized confidence set, 491- 
493 


Randomized decision rule, 116-117, 
233; risk of, 116 


Randomized estimator, 150-151 


Randomized test, 128, 393, 429, 
A77, 491 


Rank, 348 

Rank statistics, 348, 444-445, 476 
Rank test, 396-397 

Rao’s score test, see score test 
Rao-Blackwell theorem, 117 
Ratio estimator, 204-205 
Regression M-estimator, 360 
Rejection region, 115 

Repeated measurements, 361 


Replication method, see resampling 
method 


Resampling method, 376 

Residual, 188; in L-estimators, 358 
Riemann integral, 11-12 

Risk, 113, 116 

Risk set, 334 


Robustness: in Hampel’s sense, 
340-341; of L-estimators, 345, 
359; of LSE’s, 189-190; of M- 
estimators, 347, 369; of R- 
estimators, 351; of rank statis- 
tics, 349; of sample mean, me- 
dian, and trimmed mean, 355- 
357 


588 


Root of the likelihood equation 
(RLE), 290, 360; asymptotic 
efficiency of, 290-293 


8 
Sample, 92 
Sample central moment, 210 


Sample correlation coefficient, 145, 
417 


Sample covariance matrix, 373 


Sample mean, 92; admissibility of, 
241; asymptotic distribution 
of, 101-102; consistency of, 
133-134; efficiency of, 355-356; 
distribution of, 101, 112-113; 
minimaxity of, 121; moments 
of, 101; mse of, 114; optimality 
of, 118; robustness of, 355-356 


Sample median, 356 
Sample moment, 174, 207 


Sample quantile, 338; asymptotic 
distribution of, 353-355, 501; 
Bahadur’s representation for, 
354; consistency of, 352; distri- 
bution of, 352; see also sample 
median 


Sample size, 92 
Sample space, 1 
Sample standard deviation, 255 


Sample variance, 92, asymptotic 
distribution of, 101-102; con- 
sistency of, 133; distribution 
of, 101-102, 112-113; moments 
of, 101; see also sample covari- 
ance matrix 


Subject Index 


Scale family, 99, 255-257 


Scheffé’s method or intervals, 520- 
522, 525 


Scheffé’s theorem, 59 
Score function, 292 


Score test, 434, 498; asymptotic 
distribution of, 434 


Scoring, 292 
Second-stage sampling, 202 


Semi-parametric method or model, 
333 


Shortest-length confidence interval, 
484-488 


Shrinkage estimator, 269, 271-273 
Sign test, 442-443 


Signed rank statistic, 348, 480; one- 
sample Wilcoxon’s, 348, 454 


Signed rank test, 
Wilcoxon’s, 444 


444-446, 480; 


Significance level, see level of signif- 
icance 


Similar test, 404-405 

Simple function, 7 

Simple hypothesis, 394 
Simple linear regression, 185 
Simple random sampling, 93 


Simultaneous confidence intervals, 
519 


Simultaneous estimation, 267 


Single-stage sampling, 201 


Subject Index 


Size, 126, 393; see also limiting size 
Skewness, 514 

Skorohod’s theorem, 51 

Slutsky’s theorem, 60 

Smoothing splines, 332 

Squared error loss, 114, 267 
Standard deviation, 28 


Standard normal distribution or 
p.d.f., 19; see also asymptotic 
normality and normal distribu- 
tion 


Statistic, 100; distribution of, 101- 
102 


Statistical computing, 245 


Statistical decision theory, see deci- 
sion theory 


Statistical functional, 338 
Statistical inference, see inference 
Statistical model, 94 

Stepwise c.d.f., 9 

Stochastic order, 55 

Stratified sampling, 197 

Strong law of large numbers, 62, 65 
Studentized random variable, 72 
Studentized range, 523 
Submartingale, see martingale 


Substitution, 207; in variance esti- 
mation, 372-376 


Sufficiency, 93; see also minimal 
sufficiency 


589 


Sup-norm, 321 


Sup-norm or sup-norm distance, 
321 


Superefficiency, 289 
Supermartingale, see martingale 
Survey, 44, 93, 195, 327 
Survival analysis, 329, 333 
Survival data or times, 329, 333 
Survival distribution, 329 
Survival function, 334 


Symmetry: of c.d.f. or p.d.f., 25-26; 
of random variables, 25-26; of 
random vectors, 36 


Systematic sampling, 202-203 


T 


t-distribution, 21, 25; see also non- 
central t-distribution 


t-type confidence interval, 525 
Test, 115, 125, 393 


Testing independence, 410, 416, 


439 
Tightness, 56 
Transformation, 23, 59-61 


Trimmed sample mean, 344, 357, 
453 


Truncation family, 106 


Tukey’s method or intervals, 523- 
525 


Tukey’s model, 356 


590 


Two-sample linear rank statistic, 
349 


Two-sample problem, 260, 413, 
444, 449 


Two-sample rank test, 445-446; 


Wilcoxon’s, 445 
Two-sample t-test, 415, 448, 445 
Two-sided hypothesis, 401 
Two-stage sampling, 199, 202 
Two-way additive model, 465 
Two-way ANOVA, 186 
Type I error, 125, 393 
Type II error, 125, 393 


U 


q 


-statistic, 174; asymptotic distri- 
bution of, 180; variance of, 176 


U 
U 


nbiased confidence set, 490 
nbiased estimator, 119, 161 
Unbiased test, 404 
Uncorrelated random variables, 29 
Uniform distribution, 9, 20 


Uniform integrability, 51; proper- 
ties of, 52, 86 


Ga 


niformly minimum risk unbiased 
estimator, 162 


Ct 


niformly minimum variance unbi- 


ased estimator (UMVUE), 161 


G 


niformly most accurate (UMA) 
confidence set, 488-489 


Subject Index 


Ca 


niformly most accurate invariant 
(UMAI) confidence set, 493 


Uniformly most accurate unbiased 
(UMAU) confidence set, 490 


Uniformly most powerful invariant 
(UMPI) test, 417-418; in 
location-scale families, 419; in 
normal linear models, 422-427 


Uniformly most powerful (UMP) 
test, 394; in testing one-sided 
hypotheses in families with 
monotone likelihood ratio, 
399-401; in testing simple hy- 
potheses, 394; in testing two- 
sided hypotheses in exponen- 
tial families, 401-403 


Uniformly most powerful unbiased 
(UMPU) test, 404; in compari- 
son of two treatments with dis- 
crete data, 408-409; in contin- 
gency tables, 409-410; in ex- 
ponential families, 406-408; in 
normal families, 410-417; in 
normal linear models, 415-416; 
in one-sample problems, 411- 
412; in testing for indepen- 
dence in normal families, 416- 
417; in two-sample problems, 
413-415 


Unimodality, 485 


Uniqueness, of Bayes action or 
estimator, 233, 240; of distri- 
bution with a given ch.f., 35; of 
measure, 75; of minimax esti- 
mator, 261; of MRIE, 253, 256; 
of product measure, 5; of 
Radon-Nikodym derivative, 
15; of UMP test, 394; of 
UMVUE, 162 


Subject Index 


Vv 


V-statistic, 210, 342, 448; asymp- 
totic distribution of, 212; bias 
of, 211; variance of, 211 


Variance, 18, 20-21, 28 
Variance estimation, 371-372 


Variance estimator, 175, 201, 215- 
217, 373-376; see also boot- 
strap variance estimator and 
jackknife 


Vector, 7 
Volume of a confidence set, 490-491 
W 


Wald’s test, 433-434, 497-498; asy- 
mptotic distribution of, 434 


Watson-Royall theorem, 196 


591 


Weak convergence, see convergence 
in distribution 


Weak law of large numbers 


(WLLN), 62, 65 
Weibull distribution, 21 


Weighted jackknife variance esti- 
mator, 379 


Weighted least squares estimator, 
213-215 


Wild bootstrapping, 540 
Winsorized sample mean, 346 
With replacement, 142, 327 


Without replacement, 93, 197, 199, 
327 


Working correlation matrix, 362 


Woodruff’s interval, 502 


Springer Texts in Statistics (continued from page ii) 


Madansky: Prescriptions for Working Statisticians 

McPherson: Applying and Interpreting Statistics: A Comprehensive Guide, 
Second Edition 

Mueller: Basic Principles of Structural Equation Modeling: An Introduction to 
LISREL and EQS 

Nguyen and Rogers: Fundamentals of Mathematical Statistics: Volume I: 
Probability for Statistics 

Nguyen and Rogers: Fundamentals of Mathematical Statistics: Volume II: 
Statistical Inference 

Noether: Introduction to Statistics: The Nonparametric Way 

Nolan and Speed: Stat Labs: Mathematical Statistics Through Applications 

Peters: Counting for Something: Statistical Principles and Personalities 

Pfeiffer: Probability for Applications 

Pitman: Probability 

Rawlings, Pantula and Dickey: Applied Regression Analysis 

Robert: The Bayesian Choice: From Decision-Theoretic Foundations to 
Computational Implementation, Second Edition 

Robert and Casella: Monte Carlo Statistical Methods 

Rose and Smith: Mathematical Statistics with Mathematica 

Santner and Duffy: The Statistical Analysis of Discrete Data 

Saville and Wood: Statistical Methods: The Geometric Approach 

Sen and Srivastava: Regression Analysis: Theory, Methods, and Applications 

Shao: Mathematical Statistics, Second Edition 

Shorack: Probability for Statisticians 

Shumway and Stoffer: Time Series Analysis and Its Applications 

Simonoff: Analyzing Categorical Data 

Terrell: Mathematical Statistics: A Unified Introduction 

Timm: Applied Multivariate Analysis 

Toutenburg: Statistical Analysis of Designed Experiments, Second Edition 

Whittle: Probability via Expectation, Fourth Edition 

Zacks: Introduction to Reliability Analysis: Probability Models and Statistical 
Methods 


